HCI 574 - lecture 23 - glob, regex (Mar. 7, 2014)
● finish Python file system operations from lecture 22, open filesystem.py
● get and unzip lecture23.zip - will contain scripts and some play around files/folders
● scripts: using_glob.py and regex.py
● Python "demo" applications: folder_tree.py and redemo.py
● HW 6 - find files with the same name that live in different folders
● Optional: shutil module for shell commands on files/folders (copy, move, delete, etc.)
● Optional: Using the zipfile module to compress files
glob module - file/folder name pattern matching with wildcards:
● task: list all files in the current folder starting with a and ending in .txt
● use a "glob", a pattern that contains special pattern matching characters (wildcards) such as
*, ?, [0-9], [a-z], !
● http://docs.python.org/library/glob.html (Modeled after UNIX style wildcard pattern matching)
● *: matches all letters and numbers:
*.txt finds stuff.txt bla.txt but not bla.xml
● ?: matches a single letter/number: bl?.txt finds bla.txt and blo.txt
The Python glob module (using_glob.py inside lecture23 folder)
● import glob # global module
● glob.glob() function - filename pattern matching for current folder via special "wildcards"
● files = glob.glob("*.txt")# return list of files that match a certain pattern
● glob() returns empty list [] if no matches are found
● pattern must be a single string: "*.txt" or r"..\*.*" or r"c:\temp\*.*"
● Note: you can use / for glob patterns, even in Windows (no need for \\)
● to glue together parts, use os.sep:
"stuff" + os.sep + "folderA" + os.sep + "*.jpg"
● glob("*.txt") returns files bla.txt and blo.txt but not bla.doc
● glob("f*") returns files and folders starting with f
● */*.txt finds all txt files in all sub-folders
● */*/* finds all files in all the subfolders's subfolders
more complex glob patterns
●
[0-9] means a single number from 0 to 9 ( - sets up a range)
○
img[1-4].jpg finds img1.jpg, img1.jpg, ..., img4.jpg
○
img[135].jpg find only img1.jpg, img3.jpg and img5.jpg only (no - here!)
● [a-c]*
finds all files starting with a, b or c
● [!a-c]* files NOT starting with a, b, or c (i.e. only files starting with d-z), ! means not
● brainteasers: (looking at files in lecture 23 folder):
- what does img[0-9][0-9].jpg return?
- what pattern returns all report files with a 3 letter month and are from 2008 or 2009?
Regular expressions (re) - complex pattern matching in Python (also called Perl style reg.expr.)
Uses another pattern matching syntax that is different(!!!) from the glob() syntax shown above!
Regular expressions (re or regex) are a lot more powerful for pattern matching than glob() but its also quite
a bit more complex. I'll only go over a tiny fraction of what you can do with re, but here are some links:
● http://docs.python.org/2/library/re.html
● docs.python.org/dev/howto/regex.html
● https://developers.google.com/edu/python/regular-expressions
● http://www.noah.org/wiki/RegEx_Python
● http://effbot.org/librarybook/re.htm
First, let's play around with the more complex pattern matching syntax the Perl style regular expression
syntax uses.
Run the script redemo_GUI.py (in your lecture23 folder). Paste this into the middle window (text is also in
Dear Grandson.txt) and make sure that MULTILINE is checked ON!
Dear Grandson,
My current email is grama.write@com. Or is is [email protected]?
Pa's email is [email protected]. Or maybe it's grumpy@old@[email protected]?
Sorry, those funny @ signs are confusing! Please write us soon!
We will extract all syntactically valid email addresses from this text. First manually in redemo, then in our
own script. The pattern describing a syntactically valid email address is this:
[A-Za-z0-9.]+@[A-Za-z0-9.]+com
Paste this into the first line of redemo (check: show all matches)
●
●
●
●
●
●
●
●
●
A-Z : all letters from A to Z (a range)
[A-Za-z0-9] : [] => glue together several ranges: A-z or a-z or 0-9 - this gives the allowed letters
[A-Za-z0-9.]: also allow the dot (but: no space => space acts as separator!)
+: means - any allowed letter must occur one or more times.
[A-Za-z0-9.]+ defines a word (here: dot(s) are allowed, but spaces, dashes, etc. are not!)
[A-Za-z0-9.]+@a literal letter @ that must be to the right of a word
[A-Za-z0-9.]+coma literal sequence of letters that must be to the right of a word
[A-Za-z0-9.]+@[A-Za-z0-9.]+com a sequence of a word, the @, a word and the com
(\w "word: is short for A-Za-z \d "decimal" is short for 0-9)
Now let's use this inside Python (open reg_expr.py):
import re
s = """
Dear Grandson,
My current email is grama.write@com. Or is is [email protected]?
Pa's email is [email protected]. Or maybe it's grumpy@old@[email protected]?
Sorry, those funny @ signs are confusing! Please write us soon!
"""
# this string describe the pattern to match
pattern = r"[A-Za-z0-9.]+@[A-Za-z0-9.]+com"
all_matches = re.findall(pattern, s)
print all_matches # => ['[email protected]', '[email protected]', '[email protected]']
# replace matches with another string
new_s = re.sub(pattern, "[email protected]", s)
print new_s
Optional:
shutil (shell utility) module - copying, moving, deleting files and folders (OS independent)
● shutil.copy("hey.txt", "folderA") # copy file hey.txt into folderA
● shutil.copy("hey.txt", "folderA/copy_of_hey.txt") # hey.txt -> folderAcopy_of_hey.txt
● http://docs.python.org/library/shutil.html
Compressing files into a zip file archive
● http://docs.python.org/library/zipfile.html
● http://www.doughellmann.com/PyMOTW/zipfile/
● uses a zipfile object called ZipFile
● make an empty zip archive, add (write) files into archive, close archive
● actual file compression must be set via ZIP_DEFLATED (you may need to import zlib)
import zipfile
zf = zipfile.ZipFile("myzip.zip", mode="w") # make empty zip file object
zf.write('bla.txt', compress_type=zipfile.ZIP_DEFLATED) # put in zip file
zf.close() # closes write steam but object still exists!
● files (bla.txt) can have a path ("lecture23/bla.txt) but cannot be a folder
● write() caveat: does NOT automatically add sub-folders, only adds files
● Unzipping: create and open ZipFile object for read, extractall() to folder, close():
zf2 = zipfile.ZipFile("myzip.zip") # open same file for reading
os.makedirs("test") # make a test folder
zf2.extractall("test") # extract content of zf2 into folder test
● zf.infolist() returns a list of ZipInfo objects for each file in the archive, which contain: date/time,
comment, compressed size, etc.