You write regular expressions (regex) to match patterns in strings. When you are processing text, you may want to extract a substring of some predictable structure: a phone number, an email address, or something more specific to your research or task. You may also want to clean your text of some kind of junk: maybe there are repetitive formatting errors due to some transcription process that you need to remove.
In these cases and in many others like them, writing the right regex will be better than working by hand or using a magical third-party library/software that claims to do what you want.
Please refer back to the slides to see the building blocks of regex.
import re #the regex module in the python standard library
#strings to be searched for matching regex patterns
str1 = "Aardvarks belong to the Captain"
str2 = "Albert's famous equation, E = mc^2."
str3 = "Located at 455 Serra Mall."
str4 = "Beware of the shape-shifters!"
test_strings = [str1, str2, str3, str4] #created a list of strings
for test_string in test_strings:
print 'The test string is "' + test_string + '"'
match = re.search(r'[A-Z]', test_string)
if match:
print 'The first possible match is: ' + match.group()
else:
print 'no match.'
The test string is "Aardvarks belong to the Captain" The first possible match is: A The test string is "Albert's famous equation, E = mc^2." The first possible match is: A The test string is "Located at 455 Serra Mall." The first possible match is: L The test string is "Beware of the shape-shifters!" The first possible match is: B
Let's go through the code above line by line:
for test_string in test_strings:
test_strings
is a list, and so it is iterable in a for loop. Every element in this list is a string. So for the rest of the for loop, we will be referring to the current element as test_string
print 'The test string is "' + test_string + '"'
This just prints out the current object we're iterating over
match = re.search(r'[A-Z]', test_string)
Remember the basic approach to using regex in Python. You give a searcher (in this case, the function re.search()
a pattern and a string in which to find matches. That's exactly what this line does. re.search()
returns either an object of type SRE_Match
or None
.
if match:
print 'The first possible match is: ' + match.group()
else:
print 'no match.'
match
is an object that has two possible states: SRE_Match
or None
. None
is a type of object that returns false
in a logical test. In this for loop, we've basically told the Python interpreter to check whether match is NoneType
or not. If it isn't, we return a string plus match.group()
. group()
is a method that SRE_Match
objects have. By default, it returns the 0th group; we'll get to what that means later. For now, just know that it will return the substring that matched the pattern defined.
Note that since we are using re.search
, only a single character is returned. That's because of the following:
re.search
finds the first possible match and then doesn't look for any more.If you wanted to find all of the possible matches in a string, you can use re.findall()
, which will return a list of all matches:
for string in test_strings:
print re.findall(r'[A-Z]', string)
['A', 'C'] ['A', 'E'] ['L', 'S', 'M'] ['B']
You can also compile your regex ahead of time. This will create SRE_Pattern
objects. There are many performance reasons to do this. Additionally, you can create lists of these objects and iterate over both strings and patterns more easily. Here's an example:
patterns = [re.compile(r'[ABC]'),
re.compile(r'[^ABC]'),
re.compile(r'[ABC^]'),
re.compile(r'[0123456789]'),
re.compile(r'[0-9]'),
re.compile(r'[0-4]'),
re.compile(r'[A-Z]'),
re.compile(r'[A-Za-z]'),
re.compile(r'[A-Za-z0-9]'),
re.compile(r'[-a-z]'),
re.compile(r'[- a-z]')]
def find_match(pattern, string):
match = re.search(pattern, string)
if match:
return match.group()
else:
return 'no match.'
for test_string in test_strings:
matches = [find_match(pattern, test_string) for pattern in patterns]
for pattern in patterns:
print 'The first potential match for "' + pattern.pattern + '" in "' + test_string + '" is: ' + matches[patterns.index(pattern)]
The first potential match for "[ABC]" in "Aardvarks belong to the Captain" is: A The first potential match for "[^ABC]" in "Aardvarks belong to the Captain" is: a The first potential match for "[ABC^]" in "Aardvarks belong to the Captain" is: A The first potential match for "[0123456789]" in "Aardvarks belong to the Captain" is: no match. The first potential match for "[0-9]" in "Aardvarks belong to the Captain" is: no match. The first potential match for "[0-4]" in "Aardvarks belong to the Captain" is: no match. The first potential match for "[A-Z]" in "Aardvarks belong to the Captain" is: A The first potential match for "[A-Za-z]" in "Aardvarks belong to the Captain" is: A The first potential match for "[A-Za-z0-9]" in "Aardvarks belong to the Captain" is: A The first potential match for "[-a-z]" in "Aardvarks belong to the Captain" is: a The first potential match for "[- a-z]" in "Aardvarks belong to the Captain" is: a The first potential match for "[ABC]" in "Albert's famous equation, E = mc^2." is: A The first potential match for "[^ABC]" in "Albert's famous equation, E = mc^2." is: l The first potential match for "[ABC^]" in "Albert's famous equation, E = mc^2." is: A The first potential match for "[0123456789]" in "Albert's famous equation, E = mc^2." is: 2 The first potential match for "[0-9]" in "Albert's famous equation, E = mc^2." is: 2 The first potential match for "[0-4]" in "Albert's famous equation, E = mc^2." is: 2 The first potential match for "[A-Z]" in "Albert's famous equation, E = mc^2." is: A The first potential match for "[A-Za-z]" in "Albert's famous equation, E = mc^2." is: A The first potential match for "[A-Za-z0-9]" in "Albert's famous equation, E = mc^2." is: A The first potential match for "[-a-z]" in "Albert's famous equation, E = mc^2." is: l The first potential match for "[- a-z]" in "Albert's famous equation, E = mc^2." is: l The first potential match for "[ABC]" in "Located at 455 Serra Mall." is: no match. The first potential match for "[^ABC]" in "Located at 455 Serra Mall." is: L The first potential match for "[ABC^]" in "Located at 455 Serra Mall." is: no match. The first potential match for "[0123456789]" in "Located at 455 Serra Mall." is: 4 The first potential match for "[0-9]" in "Located at 455 Serra Mall." is: 4 The first potential match for "[0-4]" in "Located at 455 Serra Mall." is: 4 The first potential match for "[A-Z]" in "Located at 455 Serra Mall." is: L The first potential match for "[A-Za-z]" in "Located at 455 Serra Mall." is: L The first potential match for "[A-Za-z0-9]" in "Located at 455 Serra Mall." is: L The first potential match for "[-a-z]" in "Located at 455 Serra Mall." is: o The first potential match for "[- a-z]" in "Located at 455 Serra Mall." is: o The first potential match for "[ABC]" in "Beware of the shape-shifters!" is: B The first potential match for "[^ABC]" in "Beware of the shape-shifters!" is: e The first potential match for "[ABC^]" in "Beware of the shape-shifters!" is: B The first potential match for "[0123456789]" in "Beware of the shape-shifters!" is: no match. The first potential match for "[0-9]" in "Beware of the shape-shifters!" is: no match. The first potential match for "[0-4]" in "Beware of the shape-shifters!" is: no match. The first potential match for "[A-Z]" in "Beware of the shape-shifters!" is: B The first potential match for "[A-Za-z]" in "Beware of the shape-shifters!" is: B The first potential match for "[A-Za-z0-9]" in "Beware of the shape-shifters!" is: B The first potential match for "[-a-z]" in "Beware of the shape-shifters!" is: e The first potential match for "[- a-z]" in "Beware of the shape-shifters!" is: e
Let's go over this code line by line:
patterns = [re.compile(r'[ABC]'),
re.compile(r'[^ABC]'),
re.compile(r'[ABC^]'),
re.compile(r'[0123456789]'),
re.compile(r'[0-9]'),
re.compile(r'[0-4]'),
re.compile(r'[A-Z]'),
re.compile(r'[A-Za-z]'),
re.compile(r'[A-Za-z0-9]'),
re.compile(r'[-a-z]'),
re.compile(r'[- a-z]')]
This creates a list of SRE_Pattern
s.
def find_match(pattern, string):
match = re.search(pattern, string)
if match:
return match.group()
else:
return 'no match.'
I defined a function find_match
that expects some variables called pattern
and string
. Notice that this function is very similar to the logical condition testing from the code above. Note also that this function returns either the match.group() or a string "no match."
for test_string in test_strings:
matches = [find_match(pattern, test_string) for pattern in patterns]
By defining the find_match()
function above, I can then call it from within a list comprehension. In words, for each string test_string
that is in test_strings
, I want to compare against the list of patterns and return matches. The resulting list of matches
should be the same length as patterns
; one match per pattern tested.
for pattern in patterns:
print 'The first potential match for "' + pattern.pattern + '" in "' + test_string + '" is: ' + matches[patterns.index(pattern)]
Because I wanted to print some diagnostic code, I need to iterate over each pattern
in patterns
(a list and thus iterable) and print it out, along with the test string. If you want to get the pattern out of an SRE_Pattern
object, you can call its member method .pattern
and it will return the regex pattern as a string. Since we are nesting this loop within the bigger loop above, this loop will go over every pattern in the patterns
list for each string, and then repeat for the next string in the list test_strings
.
However, note that I am dynamically referring to the index of the matches
list. By this, I mean the following code:
matches[patterns.index(pattern)]
Make sure this makes sense to you. Remember, matches
and patterns
are the same length. That means that if I want to return the match that correspondes to the current pattern, I have to call the match at the same index as the current pattern for their respective lists. Every list has an .index()
method, and you can find the corresponding index number in the list for a given element passed to the method as an argument. So if I wanted where in patterns
was the regex r'[^ABC]'
, I could use patterns.index(re.compile(r'[^ABC]'))
. This will return an int
, which corresponds to the position of r'[^ABC]'
in patterns.
print patterns.index(re.compile(r'[^ABC]'))
1
patterns2 = [re.compile(r'.'),
re.compile(r'\w'),
re.compile(r'\W'),
re.compile(r'\d'),
re.compile(r'\D'),
re.compile(r'\n'),
re.compile(r'\r'),
re.compile(r'\t'),
re.compile(r'\f'),
re.compile(r'\s')]
test_strings.append('Aardvarks belong to the Captain, capt_hook')
for test_string in test_strings:
matches = [find_match(pattern, test_string) for pattern in patterns2]
for pattern in patterns2:
print 'The first potential match for "' + pattern.pattern + '" in "' + test_string + '" is: ' + matches[patterns2.index(pattern)]
The first potential match for "." in "Aardvarks belong to the Captain" is: A The first potential match for "\w" in "Aardvarks belong to the Captain" is: A The first potential match for "\W" in "Aardvarks belong to the Captain" is: The first potential match for "\d" in "Aardvarks belong to the Captain" is: no match. The first potential match for "\D" in "Aardvarks belong to the Captain" is: A The first potential match for "\n" in "Aardvarks belong to the Captain" is: no match. The first potential match for "\r" in "Aardvarks belong to the Captain" is: no match. The first potential match for "\t" in "Aardvarks belong to the Captain" is: no match. The first potential match for "\f" in "Aardvarks belong to the Captain" is: no match. The first potential match for "\s" in "Aardvarks belong to the Captain" is: The first potential match for "." in "Albert's famous equation, E = mc^2." is: A The first potential match for "\w" in "Albert's famous equation, E = mc^2." is: A The first potential match for "\W" in "Albert's famous equation, E = mc^2." is: ' The first potential match for "\d" in "Albert's famous equation, E = mc^2." is: 2 The first potential match for "\D" in "Albert's famous equation, E = mc^2." is: A The first potential match for "\n" in "Albert's famous equation, E = mc^2." is: no match. The first potential match for "\r" in "Albert's famous equation, E = mc^2." is: no match. The first potential match for "\t" in "Albert's famous equation, E = mc^2." is: no match. The first potential match for "\f" in "Albert's famous equation, E = mc^2." is: no match. The first potential match for "\s" in "Albert's famous equation, E = mc^2." is: The first potential match for "." in "Located at 455 Serra Mall." is: L The first potential match for "\w" in "Located at 455 Serra Mall." is: L The first potential match for "\W" in "Located at 455 Serra Mall." is: The first potential match for "\d" in "Located at 455 Serra Mall." is: 4 The first potential match for "\D" in "Located at 455 Serra Mall." is: L The first potential match for "\n" in "Located at 455 Serra Mall." is: no match. The first potential match for "\r" in "Located at 455 Serra Mall." is: no match. The first potential match for "\t" in "Located at 455 Serra Mall." is: no match. The first potential match for "\f" in "Located at 455 Serra Mall." is: no match. The first potential match for "\s" in "Located at 455 Serra Mall." is: The first potential match for "." in "Beware of the shape-shifters!" is: B The first potential match for "\w" in "Beware of the shape-shifters!" is: B The first potential match for "\W" in "Beware of the shape-shifters!" is: The first potential match for "\d" in "Beware of the shape-shifters!" is: no match. The first potential match for "\D" in "Beware of the shape-shifters!" is: B The first potential match for "\n" in "Beware of the shape-shifters!" is: no match. The first potential match for "\r" in "Beware of the shape-shifters!" is: no match. The first potential match for "\t" in "Beware of the shape-shifters!" is: no match. The first potential match for "\f" in "Beware of the shape-shifters!" is: no match. The first potential match for "\s" in "Beware of the shape-shifters!" is: The first potential match for "." in "Aardvarks belong to the Captain, capt_hook" is: A The first potential match for "\w" in "Aardvarks belong to the Captain, capt_hook" is: A The first potential match for "\W" in "Aardvarks belong to the Captain, capt_hook" is: The first potential match for "\d" in "Aardvarks belong to the Captain, capt_hook" is: no match. The first potential match for "\D" in "Aardvarks belong to the Captain, capt_hook" is: A The first potential match for "\n" in "Aardvarks belong to the Captain, capt_hook" is: no match. The first potential match for "\r" in "Aardvarks belong to the Captain, capt_hook" is: no match. The first potential match for "\t" in "Aardvarks belong to the Captain, capt_hook" is: no match. The first potential match for "\f" in "Aardvarks belong to the Captain, capt_hook" is: no match. The first potential match for "\s" in "Aardvarks belong to the Captain, capt_hook" is:
test_strings2 = ["The Aardvarks belong to the Captain.",
"Bitter butter won't make the batter better.",
"Hark, the pitter patter of little feet!"]
patterns3 = [re.compile(r'Aa'),
re.compile(r'[Aa][Aa]'),
re.compile(r'[aeiou][aeiou]'),
re.compile(r'[AaEeIiOoUu][aeiou]'),
re.compile(r'[Tt]he'),
re.compile(r'^[Tt]he'),
re.compile(r'n.'),
re.compile(r'n.$'),
re.compile(r'\W\w'),
re.compile(r'\w[aeiou]tter'),
re.compile(r'\w[aeiou]tter'),
re.compile(r'..tt..')]
for test_string in test_strings2:
matches = [find_match(pattern, test_string) for pattern in patterns3]
for pattern in patterns3:
print 'The first potential match for "' + pattern.pattern + '" in "' + test_string + '" is: ' + matches[patterns3.index(pattern)]
The first potential match for "Aa" in "The Aardvarks belong to the Captain." is: Aa The first potential match for "[Aa][Aa]" in "The Aardvarks belong to the Captain." is: Aa The first potential match for "[aeiou][aeiou]" in "The Aardvarks belong to the Captain." is: ai The first potential match for "[AaEeIiOoUu][aeiou]" in "The Aardvarks belong to the Captain." is: Aa The first potential match for "[Tt]he" in "The Aardvarks belong to the Captain." is: The The first potential match for "^[Tt]he" in "The Aardvarks belong to the Captain." is: The The first potential match for "n." in "The Aardvarks belong to the Captain." is: ng The first potential match for "n.$" in "The Aardvarks belong to the Captain." is: n. The first potential match for "\W\w" in "The Aardvarks belong to the Captain." is: A The first potential match for "\w[aeiou]tter" in "The Aardvarks belong to the Captain." is: no match. The first potential match for "\w[aeiou]tter" in "The Aardvarks belong to the Captain." is: no match. The first potential match for "..tt.." in "The Aardvarks belong to the Captain." is: no match. The first potential match for "Aa" in "Bitter butter won't make the batter better." is: no match. The first potential match for "[Aa][Aa]" in "Bitter butter won't make the batter better." is: no match. The first potential match for "[aeiou][aeiou]" in "Bitter butter won't make the batter better." is: no match. The first potential match for "[AaEeIiOoUu][aeiou]" in "Bitter butter won't make the batter better." is: no match. The first potential match for "[Tt]he" in "Bitter butter won't make the batter better." is: the The first potential match for "^[Tt]he" in "Bitter butter won't make the batter better." is: no match. The first potential match for "n." in "Bitter butter won't make the batter better." is: n' The first potential match for "n.$" in "Bitter butter won't make the batter better." is: no match. The first potential match for "\W\w" in "Bitter butter won't make the batter better." is: b The first potential match for "\w[aeiou]tter" in "Bitter butter won't make the batter better." is: Bitter The first potential match for "\w[aeiou]tter" in "Bitter butter won't make the batter better." is: Bitter The first potential match for "..tt.." in "Bitter butter won't make the batter better." is: Bitter The first potential match for "Aa" in "Hark, the pitter patter of little feet!" is: no match. The first potential match for "[Aa][Aa]" in "Hark, the pitter patter of little feet!" is: no match. The first potential match for "[aeiou][aeiou]" in "Hark, the pitter patter of little feet!" is: ee The first potential match for "[AaEeIiOoUu][aeiou]" in "Hark, the pitter patter of little feet!" is: ee The first potential match for "[Tt]he" in "Hark, the pitter patter of little feet!" is: the The first potential match for "^[Tt]he" in "Hark, the pitter patter of little feet!" is: no match. The first potential match for "n." in "Hark, the pitter patter of little feet!" is: no match. The first potential match for "n.$" in "Hark, the pitter patter of little feet!" is: no match. The first potential match for "\W\w" in "Hark, the pitter patter of little feet!" is: t The first potential match for "\w[aeiou]tter" in "Hark, the pitter patter of little feet!" is: pitter The first potential match for "\w[aeiou]tter" in "Hark, the pitter patter of little feet!" is: pitter The first potential match for "..tt.." in "Hark, the pitter patter of little feet!" is: pitter
def find_all_matches(pattern, string):
matches = re.findall(pattern, string)
if matches:
return matches
else:
return None
for test_string in test_strings2:
matches = [find_all_matches(pattern, test_string) for pattern in patterns3]
for pattern in patterns3:
if matches[patterns3.index(pattern)]:
print 'All potential matches for "' + pattern.pattern + '" in "' + test_string + '" is/are: ' + ', '.join(matches[patterns3.index(pattern)])
else:
print 'There were no matches for "' + pattern.pattern + '" in "' + test_string + '".'
All potential matches for "Aa" in "The Aardvarks belong to the Captain." is/are: Aa All potential matches for "[Aa][Aa]" in "The Aardvarks belong to the Captain." is/are: Aa All potential matches for "[aeiou][aeiou]" in "The Aardvarks belong to the Captain." is/are: ai All potential matches for "[AaEeIiOoUu][aeiou]" in "The Aardvarks belong to the Captain." is/are: Aa, ai All potential matches for "[Tt]he" in "The Aardvarks belong to the Captain." is/are: The, the All potential matches for "^[Tt]he" in "The Aardvarks belong to the Captain." is/are: The All potential matches for "n." in "The Aardvarks belong to the Captain." is/are: ng, n. All potential matches for "n.$" in "The Aardvarks belong to the Captain." is/are: n. All potential matches for "\W\w" in "The Aardvarks belong to the Captain." is/are: A, b, t, t, C There were no matches for "\w[aeiou]tter" in "The Aardvarks belong to the Captain.". There were no matches for "\w[aeiou]tter" in "The Aardvarks belong to the Captain.". There were no matches for "..tt.." in "The Aardvarks belong to the Captain.". There were no matches for "Aa" in "Bitter butter won't make the batter better.". There were no matches for "[Aa][Aa]" in "Bitter butter won't make the batter better.". There were no matches for "[aeiou][aeiou]" in "Bitter butter won't make the batter better.". There were no matches for "[AaEeIiOoUu][aeiou]" in "Bitter butter won't make the batter better.". All potential matches for "[Tt]he" in "Bitter butter won't make the batter better." is/are: the There were no matches for "^[Tt]he" in "Bitter butter won't make the batter better.". All potential matches for "n." in "Bitter butter won't make the batter better." is/are: n' There were no matches for "n.$" in "Bitter butter won't make the batter better.". All potential matches for "\W\w" in "Bitter butter won't make the batter better." is/are: b, w, 't, m, t, b, b All potential matches for "\w[aeiou]tter" in "Bitter butter won't make the batter better." is/are: Bitter, butter, batter, better All potential matches for "\w[aeiou]tter" in "Bitter butter won't make the batter better." is/are: Bitter, butter, batter, better All potential matches for "..tt.." in "Bitter butter won't make the batter better." is/are: Bitter, butter, batter, better There were no matches for "Aa" in "Hark, the pitter patter of little feet!". There were no matches for "[Aa][Aa]" in "Hark, the pitter patter of little feet!". All potential matches for "[aeiou][aeiou]" in "Hark, the pitter patter of little feet!" is/are: ee All potential matches for "[AaEeIiOoUu][aeiou]" in "Hark, the pitter patter of little feet!" is/are: ee All potential matches for "[Tt]he" in "Hark, the pitter patter of little feet!" is/are: the There were no matches for "^[Tt]he" in "Hark, the pitter patter of little feet!". There were no matches for "n." in "Hark, the pitter patter of little feet!". There were no matches for "n.$" in "Hark, the pitter patter of little feet!". All potential matches for "\W\w" in "Hark, the pitter patter of little feet!" is/are: t, p, p, o, l, f All potential matches for "\w[aeiou]tter" in "Hark, the pitter patter of little feet!" is/are: pitter, patter All potential matches for "\w[aeiou]tter" in "Hark, the pitter patter of little feet!" is/are: pitter, patter All potential matches for "..tt.." in "Hark, the pitter patter of little feet!" is/are: pitter, patter, little
We have a new function and some new code. Let's go over it:
First, I wrote a function called find_all_matches
:
def find_all_matches(pattern, string):
matches = re.findall(pattern, string)
if matches:
return matches
else:
return None
There are only two differences between find_matches
and find_all_matches
. First, find_all_matches
uses re.findall
not re.search
. So matches is a list of all possible matches. Thus, instead of return a single string in either condition, find_all_matches
can return either a list of strings or None
.
for test_string in test_strings2:
matches = [find_all_matches(pattern, test_string) for pattern in patterns3]
for pattern in patterns3:
if matches[patterns3.index(pattern)]:
Remember the use of .index()
from the previous code walkthrough. Also, remember that None
returns false in a logical condition test. In this if
statement, I'm testing to see if there were any matches for the current pattern in the loop. If there were any matches, the code will execute the next line. Otherwise, it will go to the else
block.
print 'All potential matches for "' + pattern.pattern + '" in "' + test_string + '" is/are: ' + ', '.join(matches[patterns3.index(pattern)])
If matches
at the index of the current pattern is not None
, it will be a list of strings. Because I'm printing these results, I wanted to nicely format them for diagnostic purposes. So we use the standard list-to-string Python expression of ''.join(list)
. In this case, I wanted the results to be comma-separated.
else:
print 'There were no matches for "' + pattern.pattern + '" in "' + test_string + '".'
test_strings3 = ['Now Mr. N said, "Nooooooo!"',
'Then she told him he had to be quiet.']
patterns4 = [re.compile(r'No*'),
re.compile(r'No+'),
re.compile(r'No?'),
re.compile(r'No{7}'),
re.compile(r's?he'),
re.compile(r'(she|he)')]
for test_string in test_strings3:
matches = [find_all_matches(pattern, test_string) for pattern in patterns4]
for pattern in patterns4:
if matches[patterns4.index(pattern)]:
print 'All potential matches for "' + pattern.pattern + '" in "' + test_string + '" is/are: ' + ', '.join(matches[patterns4.index(pattern)])
else:
print 'There were no matches for "' + pattern.pattern + '" in "' + test_string + '".'
All potential matches for "No*" in "Now Mr. N said, "Nooooooo!"" is/are: No, N, Nooooooo All potential matches for "No+" in "Now Mr. N said, "Nooooooo!"" is/are: No, Nooooooo All potential matches for "No?" in "Now Mr. N said, "Nooooooo!"" is/are: No, N, No All potential matches for "No{7}" in "Now Mr. N said, "Nooooooo!"" is/are: Nooooooo There were no matches for "s?he" in "Now Mr. N said, "Nooooooo!"". There were no matches for "(she|he)" in "Now Mr. N said, "Nooooooo!"". There were no matches for "No*" in "Then she told him he had to be quiet.". There were no matches for "No+" in "Then she told him he had to be quiet.". There were no matches for "No?" in "Then she told him he had to be quiet.". There were no matches for "No{7}" in "Then she told him he had to be quiet.". All potential matches for "s?he" in "Then she told him he had to be quiet." is/are: he, she, he All potential matches for "(she|he)" in "Then she told him he had to be quiet." is/are: he, she, he
In Python, SRE_Match
objects have .groups
and .group
methods. These correspond to the capturing groups established in the regex, if you chose to indicate groups. By default, the 0th group is the entire match to the whole regex. To access the result for a capturing group, you pass the capturing group index to the .group
method.
test_strings4 = ['The benefit is being held for Mr. Kite and Mr. Henderson.',
'Tickets cost $5.00 for adults, $3.50 for children.',
'Over 9000 attendees are expected, up from 900 attendees last year.',
'Over 9,000 attendees are expected, up from 900 attendees last year.']
patterns5 = [re.compile(r'Mr\. (\w+)'),
re.compile(r'\$(\d+\.\d\d)'),
re.compile(r'(\d+) attendees'),
re.compile(r'((\d+,)*\d+) attendees')]
# simple example
matches = re.search(patterns5[3], test_strings4[3])
print 'Group 0: ' + matches.group(0)
print 'Group 1: ' + matches.group(1)
print 'Group 2: ' + matches.group(2)
#print 'Group 3: ' + matches.group(3) # what happens if you uncomment this?
Group 0: 9,000 attendees Group 1: 9,000 Group 2: 9,
This example searched for r'((\d+,)*\d+) attendees' in the string "Over 9000 attendees are expected, up from 900 attendees last year.'" There are two groups, one nested inside the other. Groups are indexed outer-most left parens. This is why Group 1 is 9,000
and Group 2 is 9,
.
for test_string in test_strings4:
for pattern in patterns5:
for result in re.finditer(pattern, test_string):
for i in range(pattern.groups+1):
print 'In "' + test_string + '", ' + 'given pattern "' + pattern.pattern + '", the group ' +str(i)+ ' match is ' + str(result.group(i))
In "The benefit is being held for Mr. Kite and Mr. Henderson.", given pattern "Mr\. (\w+)", the group 0 match is Mr. Kite In "The benefit is being held for Mr. Kite and Mr. Henderson.", given pattern "Mr\. (\w+)", the group 1 match is Kite In "The benefit is being held for Mr. Kite and Mr. Henderson.", given pattern "Mr\. (\w+)", the group 0 match is Mr. Henderson In "The benefit is being held for Mr. Kite and Mr. Henderson.", given pattern "Mr\. (\w+)", the group 1 match is Henderson In "Tickets cost $5.00 for adults, $3.50 for children.", given pattern "\$(\d+\.\d\d)", the group 0 match is $5.00 In "Tickets cost $5.00 for adults, $3.50 for children.", given pattern "\$(\d+\.\d\d)", the group 1 match is 5.00 In "Tickets cost $5.00 for adults, $3.50 for children.", given pattern "\$(\d+\.\d\d)", the group 0 match is $3.50 In "Tickets cost $5.00 for adults, $3.50 for children.", given pattern "\$(\d+\.\d\d)", the group 1 match is 3.50 In "Over 9000 attendees are expected, up from 900 attendees last year.", given pattern "(\d+) attendees", the group 0 match is 9000 attendees In "Over 9000 attendees are expected, up from 900 attendees last year.", given pattern "(\d+) attendees", the group 1 match is 9000 In "Over 9000 attendees are expected, up from 900 attendees last year.", given pattern "(\d+) attendees", the group 0 match is 900 attendees In "Over 9000 attendees are expected, up from 900 attendees last year.", given pattern "(\d+) attendees", the group 1 match is 900 In "Over 9000 attendees are expected, up from 900 attendees last year.", given pattern "((\d+,)*\d+) attendees", the group 0 match is 9000 attendees In "Over 9000 attendees are expected, up from 900 attendees last year.", given pattern "((\d+,)*\d+) attendees", the group 1 match is 9000 In "Over 9000 attendees are expected, up from 900 attendees last year.", given pattern "((\d+,)*\d+) attendees", the group 2 match is None In "Over 9000 attendees are expected, up from 900 attendees last year.", given pattern "((\d+,)*\d+) attendees", the group 0 match is 900 attendees In "Over 9000 attendees are expected, up from 900 attendees last year.", given pattern "((\d+,)*\d+) attendees", the group 1 match is 900 In "Over 9000 attendees are expected, up from 900 attendees last year.", given pattern "((\d+,)*\d+) attendees", the group 2 match is None In "Over 9,000 attendees are expected, up from 900 attendees last year.", given pattern "(\d+) attendees", the group 0 match is 000 attendees In "Over 9,000 attendees are expected, up from 900 attendees last year.", given pattern "(\d+) attendees", the group 1 match is 000 In "Over 9,000 attendees are expected, up from 900 attendees last year.", given pattern "(\d+) attendees", the group 0 match is 900 attendees In "Over 9,000 attendees are expected, up from 900 attendees last year.", given pattern "(\d+) attendees", the group 1 match is 900 In "Over 9,000 attendees are expected, up from 900 attendees last year.", given pattern "((\d+,)*\d+) attendees", the group 0 match is 9,000 attendees In "Over 9,000 attendees are expected, up from 900 attendees last year.", given pattern "((\d+,)*\d+) attendees", the group 1 match is 9,000 In "Over 9,000 attendees are expected, up from 900 attendees last year.", given pattern "((\d+,)*\d+) attendees", the group 2 match is 9, In "Over 9,000 attendees are expected, up from 900 attendees last year.", given pattern "((\d+,)*\d+) attendees", the group 0 match is 900 attendees In "Over 9,000 attendees are expected, up from 900 attendees last year.", given pattern "((\d+,)*\d+) attendees", the group 1 match is 900 In "Over 9,000 attendees are expected, up from 900 attendees last year.", given pattern "((\d+,)*\d+) attendees", the group 2 match is None
Before we go over this code block, let's establish the purpose of the code. I wanted to return all the matches for each group. But there are a few concerns:
matches = re.findall(patterns5[3], test_strings4[3])
matches
[('9,000', '9,'), ('900', '')]
You can refer to the index of a tuple within a list of tuples through indexing a second index:
matches[0][0]
'9,000'
But there are other ways of constructing this kind of loop.
for test_string in test_strings4:
for pattern in patterns5:
for result in re.finditer(pattern, test_string):
re.finditer
returns an iterator, which is a new Python concept to you. This loop means that for every pattern and for each string we're testing, instead of creating a list of matches, we're going to create a iterator object that contains the results.
for i in range(pattern.groups+1):
The .groups
method will list the number of capturing groups in the regular expression. range
is a function that will return a list of integers ranging from a start or a stop value and by a step value. If you just give it a int, by default it will treat that value is a stopping value and start from 0. Now, we add 1 to this value because the end point is omitted in range
. If we want to return all the groups, we have to add that end point back.
print 'In "' + test_string + '", ' + 'given pattern "' + pattern.pattern + '", the group ' +str(i)+ ' match is ' + str(result.group(i))
Because i
is established as the index value of the current regex match produced by the iterator, we can use i
as the index value for which group we'd like to return. That's why we can call result.group(i)
.
In no way was this the only way to accomplish this task! I wanted to show you a few different functions in this tutorial, as well as introduce you to the more examples where typical "Pythonic" code constructions are useful, such as list comprehensions and join
. There are many ways of replicating all of these diagnostic printout examples.
Let's see how much you've learned. We're going to give you three strings that have a phone number in them. Your job is to write a regex that will return the full form of all of them.
phone_strings = ['Call Empire Carpets at 588-2300',
'Does Jenny live at 867 5309?',
'You can reach Mr. Plow at 636-555-3226']