CSC105, Introduction to Computer Science Lab06: Analyzing Text Documents with Perl [NOTE: This material assumes that you have reviewed Chapters 3, “Lists and Hashes” in Beginning Perl, especially pages 104–9.] Directions. In this lab, you will explore analyzing text using Perl programs in a series of separate activities. Create a submission folder named with your last name + Lab06. As previously, this will serve as the home location for all of your programs or scripts. Upload this folder electronically when you complete the lab activities. You will need the folder lab06Files, which is in the lab materials folder in the csc121 folder in your Box account. Download a copy to your desktop. It contains a text file that will be employed in these activities. It is a good idea to relocate the test text file into your submission folder as before. Otherwise, you may have some problems opening the files when executing your programs. I. ACTIVITY: Understanding Hashes. 1. Hashes. A hash data structure is like an array or list in that it can hold any number of values that you can retrieve later when needed. But instead of indexing the values by number (i.e., position), you can look up the values by name. Technically, a hash is an associative array. Think of it as two arrays that are linked side-‐by-‐side. The first array is a list of keys used to index another list of values. In Figure 1, the hash has five elements. The keys are always string values. Each one references a specific value associated with it. Hash values can be any legal scalar value. In Figure 1, there are both numeric and text instances. Figure 1: Picturing a Perl hash as an associative array. 2. Keys. The keys are arbitrary. You can use any string expression as a key; moreover, they do not have to be organized in any particular order. The only constraint is that they must be unique. The reason for this restriction is that keys uniquely identify their associated values. Thus, the key acts like a tag. Whenever we are searching for a particular value, we identify it by its tag (key). It is common to use a special variable for denoting keys, i.e., $key. -‐1-‐ The advantage of using hashes is that you do not have to know where a value is stored—as in a list or array. You can find it just by indicating the key. 3. Values. As mentioned, values may be numbers, text, or mixtures of both. In addition, they do not have to be unique. For example, if you were storing the height of a group of people, it is certainly possible that two or more might be the same height. It is also common to use a special variable for denoting values, $value. 4. Here is a simple example that defines a hash called coins. Notice that the %-‐sign is used to signify the variable coins refers to a hash—just as $ denotes scalar values and @ refers to lists. #!/usr/bin/perl # example1.pl # define the hash %coins = ("Quarter", 25, "Dime", 10, "Nickel", 5); # print the hash print %coins; Create your own version—call it example1.pl—and observe the output. 5. A more civilized version is given below as example2.pl. #! /usr/bin/perl # example2.pl # define hash %coins = ( 'Quarter' => 25, 'Dime' => 10, 'Nickel' => 5, 'Penny' => 1 ); #loop through the hash printing each pair while (($key, $value) = each %coins) { $line = $key . ' ,' . $value; print "$line\n"; } The big arrow operator (=>) can be used for assigning keys and values. The formatting, however, is optional. Try this version and note the difference in output. The loop uses the each function, which makes it possible to iterate over the entire hash structure. In practice, the only way to use the each function is within a while loop, as in the example. NOTE: The $line variable demonstrates how you can concatenate additional information or formatting along with the hash keys and values. 6. This next version, example3.pl, shows how to organize the keys using the sort( ) function. -‐2-‐ #! /usr/bin/perl # example3.pl # define hash %coins = ( 'Quarter' => 25, 'Dime' => 10, 'Nickel' => 5, 'Penny' => 1 ); #loop through the hash printing each pair foreach $x (sort keys %coins) { print "$x: $coins{$x}\n"; } Try this version and compare the output with example2.pl. The expression $coins{$x} acts as the variable for each element value. The Perl idiom would normally be written as $coins{$key}, but this example shows that there is no special significance in the scalar variable. Whatever variable you designate in the foreach loop is what would be expected inside the braces. 7. If we wanted to sort the hash by values instead of keys, some changes would be necessary to the hash values. Examine this last version, example4.pl. #! /usr/bin/perl # example4.pl # define hash %coins = ( 'Quarter' => 0.25, 'Dime' => 0.10, 'Nickel' => 0.05, 'Penny' => 0.01 ); #loop through the hash printing each pair foreach $value (sort {$coins{$a} cmp $coins{$b}} keys %coins) { print "$value: $coins{$value}\n"; } The numeric values must be changed to reflect accurate comparisons. Try this version. Is the output sorted properly? How does it work? HINT: Recall the cmp operator? 8. Include copies of example1.pl, example2.pl, example3.pl, and example4.pl in your submission folder. II. ACTIVITY: Counting Word Frequencies. 1. Create a program that opens a file, reads it line-‐by-‐line, processes the words in each line—counting the number of times that the word occurs, and outputs a list of the words with their total number of occurrences. Use the file gettysburg2.txt for testing purposes. A copy is available in the folder lab06files. -‐3-‐ HINTS: • you will need to create an empty hash initially. Here is an example: %words = (); • to count the words properly, it will be necessary to convert every line to lowercase and strip all the punctuation from it. (These are tasks that you have done before.) • finally, the hash structure can be used to count the frequency of words by using an assignment like $words{$x} = $words{$x} + 1; where $x (variable) refers the current word in the word list and the key in the hash structure. Specifically, using a loop that processes each word in the current line, the result of the assignment statement (a) adds any new word to the hash structure with a value of 1, or (b) increments the value of any key value already found in the hash structure. Name your program frequency.pl and include it in the submission folder. Here is a brief excerpt of how the output might look. shall 3 should 1 so 3 struggled 1 take 1 task 1 testing 1 that 13 the 11 their 1 these 2 they 3 this 4 those 1 thus 1 to 8 under 1 unfinished 1 us 3 vain 1 war 2 we 10 what 2 whether 1 which 2 who 3 will 1 work 1 world 1 years 1 -‐4-‐ III. ACTIVITY: Building a Basic Concordance 1. A concordance is an alphabetical listing of the principal words in a text document along with their immediate contexts. Concordances are used for studying word usage, word frequencies, analyzing idioms, creating indices, and a host of other applications. Create a Perl program called concord.pl that constructs a basic concordance of its input text file. Use gettysburg2.txt for testing purposes. Here is an excerpt of how the output might look. (Color has been added for emphasis only.) can 4 conceived and so dedicated can long endure. We are met on a great battlefield of 10 The world will little note nor long remember what we say here, but it can never forget cannot 7 fitting and proper that we should do this. But in a larger sense, we cannot dedicate, 8 we cannot consecrate, we cannot hallow this ground. The brave men, living and dead 8 we cannot consecrate, we cannot hallow this ground. The brave men, living and dead cause 14 these honored dead we take increased devotion to that cause for which they gave the civil 3 Now we are engaged in a great civil war, testing whether that nation or any nation so come 5 that war. We have come to dedicate a portion of that field as a final resting-place conceived 2 conceived are created 4 conceived battlefield in liberty and dedicated to the proposition that all men equal. and so dedicated can long endure. We are met on a great of consecrate 8 we cannot consecrate, we cannot hallow this ground. The brave men, living and dead consecrated 9 who struggled here have consecrated it far above our poor power to add or detract. continent 1 Fourscore and seven years ago our fathers brought forth on this continent a new nation, -‐5-‐ created 2 conceived in liberty and dedicated to the proposition that all men are created equal. dead 8 we cannot consecrate, we cannot hallow this ground. The brave men, living and dead 14 these honored dead we take increased devotion to that cause for which they gave the 15 last full measure of devotion--that we here highly resolve that these dead shall not As you can see, each block lists • the word (in alphabetical order), and • each of the original lines in which it occurs; • the lines are also numbered to signify where they occur in the file. • the line is repeated if the word occurs more than once. (A smarter version might notice this, but we won’t.) As before, the basic process involves reading in the file line by line and processing it. Afterwards, the results are displayed. Here is a pseudocoded version of the algorithm # create a hash structure of words in the text (keys) and the lines they occur in (values) while there are more lines to read increment the line number counter remove the newline from it copy the original line so that it can be reused convert the line to lowercase strip out all punctuation split the line into individual words for each word in the line add it to a hash structure along with the line number and line copy end for end while (lines) # display the results for each key in the sorted hash structure split the values (i.e., line occurrences) into separate units in a list for each unit in the list print each unit (line occurrence) end for (units) end for (keys) The trick is to add each line occurrence to the hash structure by concatenating it to the value. Here is one way to do it. foreach $x (@word) { $words{$x} = $words{$x} . $lineNum . ' ' . $copy . '#'; } @ word = the list of words from the current line. -‐6-‐ $words{$x} is the variable signifying the value of the hash whose key is $x. $lineNum is the variable that tracks the current line’s number in the text, and $copy is the copy of the original line of text from the file. As before, the assignment causes either a new hash <key, value> pair to be added, if $x is not in the hash structure, or concatenates the line number + a space + a copy of the original line + the # symbol to an existing hash value that matches $x. Later, when it is time to print out these lines, the ‘#’ symbol can be used to separate the individual line occurrences (using the split( ) function) for printing. Specifically, @lines = split("#", $words{$x}); where @lines is a list of strings separated from the current hash value So, the ‘#’ symbol is just a convenient marker for distinguishing one line occurrence from another. -‐7-‐
© Copyright 2024