Notebook

Regex tutorial¶

You write regular expressions (regex) to match patterns in strings. When you are processing text, you may want to extract a substring of some predictable structure: a phone number, an email address, or something more specific to your research or task. You may also want to clean your text of some kind of junk: maybe there are repetitive formatting errors due to some transcription process that you need to remove.

In these cases and in many others like them, writing the right regex will be better than working by hand or using a magical third-party library/software that claims to do what you want.

Please refer back to the slides to see the building blocks of regex.

Character classes¶

Used to match any one of a specific set of characters
Defined using the [ and ] metacharacters
Within a character class, ^ and - can have special meaning (complement and range), depending on their position in the class

In [28]:

import re #the regex module in the python standard library

#strings to be searched for matching regex patterns
str1 = "Aardvarks belong to the Captain"
str2 = "Albert's famous equation, E = mc^2."
str3 = "Located at 455 Serra Mall."
str4 = "Beware of the shape-shifters!"
test_strings = [str1, str2, str3, str4] #created a list of strings

In [29]:

for test_string in test_strings:
    print 'The test string is "' + test_string + '"'
    match = re.search(r'[A-Z]', test_string)
    if match:
        print 'The first possible match is: ' + match.group()
    else:
        print 'no match.'

The test string is "Aardvarks belong to the Captain"
The first possible match is: A
The test string is "Albert's famous equation, E = mc^2."
The first possible match is: A
The test string is "Located at 455 Serra Mall."
The first possible match is: L
The test string is "Beware of the shape-shifters!"
The first possible match is: B

Let's go through the code above line by line:

for test_string in test_strings:

test_strings is a list, and so it is iterable in a for loop. Every element in this list is a string. So for the rest of the for loop, we will be referring to the current element as test_string

print 'The test string is "' + test_string + '"'

This just prints out the current object we're iterating over

match = re.search(r'[A-Z]', test_string)

Remember the basic approach to using regex in Python. You give a searcher (in this case, the function re.search() a pattern and a string in which to find matches. That's exactly what this line does. re.search() returns either an object of type SRE_Match or None.

if match:
    print 'The first possible match is: ' + match.group()
else:
    print 'no match.'

match is an object that has two possible states: SRE_Match or None. None is a type of object that returns false in a logical test. In this for loop, we've basically told the Python interpreter to check whether match is NoneType or not. If it isn't, we return a string plus match.group(). group() is a method that SRE_Match objects have. By default, it returns the 0th group; we'll get to what that means later. For now, just know that it will return the substring that matched the pattern defined.

Note that since we are using re.search, only a single character is returned. That's because of the following:

We only defined a single character pattern and
re.search finds the first possible match and then doesn't look for any more.

If you wanted to find all of the possible matches in a string, you can use re.findall(), which will return a list of all matches:

In [30]:

for string in test_strings:
    print re.findall(r'[A-Z]', string)

['A', 'C']
['A', 'E']
['L', 'S', 'M']
['B']

You can also compile your regex ahead of time. This will create SRE_Pattern objects. There are many performance reasons to do this. Additionally, you can create lists of these objects and iterate over both strings and patterns more easily. Here's an example:

In [31]:

patterns = [re.compile(r'[ABC]'),
re.compile(r'[^ABC]'),
re.compile(r'[ABC^]'),
re.compile(r'[0123456789]'),
re.compile(r'[0-9]'),
re.compile(r'[0-4]'),
re.compile(r'[A-Z]'),
re.compile(r'[A-Za-z]'),
re.compile(r'[A-Za-z0-9]'),
re.compile(r'[-a-z]'),
re.compile(r'[- a-z]')]

def find_match(pattern, string):
    match = re.search(pattern, string)
    if match:
        return match.group()
    else:
        return 'no match.'
    
for test_string in test_strings:
    matches = [find_match(pattern, test_string) for pattern in patterns]
  
    for pattern in patterns:
        print 'The first potential match for "' + pattern.pattern + '" in "' + test_string + '" is: ' + matches[patterns.index(pattern)]

The first potential match for "[ABC]" in "Aardvarks belong to the Captain" is: A
The first potential match for "[^ABC]" in "Aardvarks belong to the Captain" is: a
The first potential match for "[ABC^]" in "Aardvarks belong to the Captain" is: A
The first potential match for "[0123456789]" in "Aardvarks belong to the Captain" is: no match.
The first potential match for "[0-9]" in "Aardvarks belong to the Captain" is: no match.
The first potential match for "[0-4]" in "Aardvarks belong to the Captain" is: no match.
The first potential match for "[A-Z]" in "Aardvarks belong to the Captain" is: A
The first potential match for "[A-Za-z]" in "Aardvarks belong to the Captain" is: A
The first potential match for "[A-Za-z0-9]" in "Aardvarks belong to the Captain" is: A
The first potential match for "[-a-z]" in "Aardvarks belong to the Captain" is: a
The first potential match for "[- a-z]" in "Aardvarks belong to the Captain" is: a
The first potential match for "[ABC]" in "Albert's famous equation, E = mc^2." is: A
The first potential match for "[^ABC]" in "Albert's famous equation, E = mc^2." is: l
The first potential match for "[ABC^]" in "Albert's famous equation, E = mc^2." is: A
The first potential match for "[0123456789]" in "Albert's famous equation, E = mc^2." is: 2
The first potential match for "[0-9]" in "Albert's famous equation, E = mc^2." is: 2
The first potential match for "[0-4]" in "Albert's famous equation, E = mc^2." is: 2
The first potential match for "[A-Z]" in "Albert's famous equation, E = mc^2." is: A
The first potential match for "[A-Za-z]" in "Albert's famous equation, E = mc^2." is: A
The first potential match for "[A-Za-z0-9]" in "Albert's famous equation, E = mc^2." is: A
The first potential match for "[-a-z]" in "Albert's famous equation, E = mc^2." is: l
The first potential match for "[- a-z]" in "Albert's famous equation, E = mc^2." is: l
The first potential match for "[ABC]" in "Located at 455 Serra Mall." is: no match.
The first potential match for "[^ABC]" in "Located at 455 Serra Mall." is: L
The first potential match for "[ABC^]" in "Located at 455 Serra Mall." is: no match.
The first potential match for "[0123456789]" in "Located at 455 Serra Mall." is: 4
The first potential match for "[0-9]" in "Located at 455 Serra Mall." is: 4
The first potential match for "[0-4]" in "Located at 455 Serra Mall." is: 4
The first potential match for "[A-Z]" in "Located at 455 Serra Mall." is: L
The first potential match for "[A-Za-z]" in "Located at 455 Serra Mall." is: L
The first potential match for "[A-Za-z0-9]" in "Located at 455 Serra Mall." is: L
The first potential match for "[-a-z]" in "Located at 455 Serra Mall." is: o
The first potential match for "[- a-z]" in "Located at 455 Serra Mall." is: o
The first potential match for "[ABC]" in "Beware of the shape-shifters!" is: B
The first potential match for "[^ABC]" in "Beware of the shape-shifters!" is: e
The first potential match for "[ABC^]" in "Beware of the shape-shifters!" is: B
The first potential match for "[0123456789]" in "Beware of the shape-shifters!" is: no match.
The first potential match for "[0-9]" in "Beware of the shape-shifters!" is: no match.
The first potential match for "[0-4]" in "Beware of the shape-shifters!" is: no match.
The first potential match for "[A-Z]" in "Beware of the shape-shifters!" is: B
The first potential match for "[A-Za-z]" in "Beware of the shape-shifters!" is: B
The first potential match for "[A-Za-z0-9]" in "Beware of the shape-shifters!" is: B
The first potential match for "[-a-z]" in "Beware of the shape-shifters!" is: e
The first potential match for "[- a-z]" in "Beware of the shape-shifters!" is: e

Let's go over this code line by line:

patterns = [re.compile(r'[ABC]'),
re.compile(r'[^ABC]'),
re.compile(r'[ABC^]'),
re.compile(r'[0123456789]'),
re.compile(r'[0-9]'),
re.compile(r'[0-4]'),
re.compile(r'[A-Z]'),
re.compile(r'[A-Za-z]'),
re.compile(r'[A-Za-z0-9]'),
re.compile(r'[-a-z]'),
re.compile(r'[- a-z]')]

This creates a list of SRE_Patterns.

def find_match(pattern, string):
    match = re.search(pattern, string)
    if match:
        return match.group()
    else:
        return 'no match.'

I defined a function find_match that expects some variables called pattern and string. Notice that this function is very similar to the logical condition testing from the code above. Note also that this function returns either the match.group() or a string "no match."

for test_string in test_strings:
    matches = [find_match(pattern, test_string) for pattern in patterns]

By defining the find_match() function above, I can then call it from within a list comprehension. In words, for each string test_string that is in test_strings, I want to compare against the list of patterns and return matches. The resulting list of matches should be the same length as patterns; one match per pattern tested.

    for pattern in patterns:
        print 'The first potential match for "' + pattern.pattern + '" in "' + test_string + '" is: ' + matches[patterns.index(pattern)]

Because I wanted to print some diagnostic code, I need to iterate over each pattern in patterns (a list and thus iterable) and print it out, along with the test string. If you want to get the pattern out of an SRE_Pattern object, you can call its member method .pattern and it will return the regex pattern as a string. Since we are nesting this loop within the bigger loop above, this loop will go over every pattern in the patterns list for each string, and then repeat for the next string in the list test_strings.

However, note that I am dynamically referring to the index of the matches list. By this, I mean the following code:

matches[patterns.index(pattern)]

Make sure this makes sense to you. Remember, matches and patterns are the same length. That means that if I want to return the match that correspondes to the current pattern, I have to call the match at the same index as the current pattern for their respective lists. Every list has an .index() method, and you can find the corresponding index number in the list for a given element passed to the method as an argument. So if I wanted where in patterns was the regex r'[^ABC]', I could use patterns.index(re.compile(r'[^ABC]')). This will return an int, which corresponds to the position of r'[^ABC]' in patterns.

In [26]:

print patterns.index(re.compile(r'[^ABC]'))

Pre-defined character classes: shorthand¶

In [32]:

patterns2 = [re.compile(r'.'),
re.compile(r'\w'),
re.compile(r'\W'),
re.compile(r'\d'),
re.compile(r'\D'),
re.compile(r'\n'),
re.compile(r'\r'),
re.compile(r'\t'),
re.compile(r'\f'),
re.compile(r'\s')]

test_strings.append('Aardvarks belong to the Captain, capt_hook')

for test_string in test_strings:
    matches = [find_match(pattern, test_string) for pattern in patterns2]

    for pattern in patterns2:
        print 'The first potential match for "' + pattern.pattern + '" in "' + test_string + '" is: ' + matches[patterns2.index(pattern)]

The first potential match for "." in "Aardvarks belong to the Captain" is: A
The first potential match for "\w" in "Aardvarks belong to the Captain" is: A
The first potential match for "\W" in "Aardvarks belong to the Captain" is:  
The first potential match for "\d" in "Aardvarks belong to the Captain" is: no match.
The first potential match for "\D" in "Aardvarks belong to the Captain" is: A
The first potential match for "\n" in "Aardvarks belong to the Captain" is: no match.
The first potential match for "\r" in "Aardvarks belong to the Captain" is: no match.
The first potential match for "\t" in "Aardvarks belong to the Captain" is: no match.
The first potential match for "\f" in "Aardvarks belong to the Captain" is: no match.
The first potential match for "\s" in "Aardvarks belong to the Captain" is:  
The first potential match for "." in "Albert's famous equation, E = mc^2." is: A
The first potential match for "\w" in "Albert's famous equation, E = mc^2." is: A
The first potential match for "\W" in "Albert's famous equation, E = mc^2." is: '
The first potential match for "\d" in "Albert's famous equation, E = mc^2." is: 2
The first potential match for "\D" in "Albert's famous equation, E = mc^2." is: A
The first potential match for "\n" in "Albert's famous equation, E = mc^2." is: no match.
The first potential match for "\r" in "Albert's famous equation, E = mc^2." is: no match.
The first potential match for "\t" in "Albert's famous equation, E = mc^2." is: no match.
The first potential match for "\f" in "Albert's famous equation, E = mc^2." is: no match.
The first potential match for "\s" in "Albert's famous equation, E = mc^2." is:  
The first potential match for "." in "Located at 455 Serra Mall." is: L
The first potential match for "\w" in "Located at 455 Serra Mall." is: L
The first potential match for "\W" in "Located at 455 Serra Mall." is:  
The first potential match for "\d" in "Located at 455 Serra Mall." is: 4
The first potential match for "\D" in "Located at 455 Serra Mall." is: L
The first potential match for "\n" in "Located at 455 Serra Mall." is: no match.
The first potential match for "\r" in "Located at 455 Serra Mall." is: no match.
The first potential match for "\t" in "Located at 455 Serra Mall." is: no match.
The first potential match for "\f" in "Located at 455 Serra Mall." is: no match.
The first potential match for "\s" in "Located at 455 Serra Mall." is:  
The first potential match for "." in "Beware of the shape-shifters!" is: B
The first potential match for "\w" in "Beware of the shape-shifters!" is: B
The first potential match for "\W" in "Beware of the shape-shifters!" is:  
The first potential match for "\d" in "Beware of the shape-shifters!" is: no match.
The first potential match for "\D" in "Beware of the shape-shifters!" is: B
The first potential match for "\n" in "Beware of the shape-shifters!" is: no match.
The first potential match for "\r" in "Beware of the shape-shifters!" is: no match.
The first potential match for "\t" in "Beware of the shape-shifters!" is: no match.
The first potential match for "\f" in "Beware of the shape-shifters!" is: no match.
The first potential match for "\s" in "Beware of the shape-shifters!" is:  
The first potential match for "." in "Aardvarks belong to the Captain, capt_hook" is: A
The first potential match for "\w" in "Aardvarks belong to the Captain, capt_hook" is: A
The first potential match for "\W" in "Aardvarks belong to the Captain, capt_hook" is:  
The first potential match for "\d" in "Aardvarks belong to the Captain, capt_hook" is: no match.
The first potential match for "\D" in "Aardvarks belong to the Captain, capt_hook" is: A
The first potential match for "\n" in "Aardvarks belong to the Captain, capt_hook" is: no match.
The first potential match for "\r" in "Aardvarks belong to the Captain, capt_hook" is: no match.
The first potential match for "\t" in "Aardvarks belong to the Captain, capt_hook" is: no match.
The first potential match for "\f" in "Aardvarks belong to the Captain, capt_hook" is: no match.
The first potential match for "\s" in "Aardvarks belong to the Captain, capt_hook" is:

Matching character sequences¶

In [33]:

test_strings2 = ["The Aardvarks belong to the Captain.",
                 "Bitter butter won't make the batter better.",
                 "Hark, the pitter patter of little feet!"]

patterns3 = [re.compile(r'Aa'),
re.compile(r'[Aa][Aa]'),
re.compile(r'[aeiou][aeiou]'),
re.compile(r'[AaEeIiOoUu][aeiou]'),
re.compile(r'[Tt]he'),
re.compile(r'^[Tt]he'),
re.compile(r'n.'),
re.compile(r'n.$'),
re.compile(r'\W\w'),
re.compile(r'\w[aeiou]tter'),
re.compile(r'\w[aeiou]tter'),
re.compile(r'..tt..')]

for test_string in test_strings2:
    matches = [find_match(pattern, test_string) for pattern in patterns3]

    for pattern in patterns3:
        print 'The first potential match for "' + pattern.pattern + '" in "' + test_string + '" is: ' + matches[patterns3.index(pattern)]

The first potential match for "Aa" in "The Aardvarks belong to the Captain." is: Aa
The first potential match for "[Aa][Aa]" in "The Aardvarks belong to the Captain." is: Aa
The first potential match for "[aeiou][aeiou]" in "The Aardvarks belong to the Captain." is: ai
The first potential match for "[AaEeIiOoUu][aeiou]" in "The Aardvarks belong to the Captain." is: Aa
The first potential match for "[Tt]he" in "The Aardvarks belong to the Captain." is: The
The first potential match for "^[Tt]he" in "The Aardvarks belong to the Captain." is: The
The first potential match for "n." in "The Aardvarks belong to the Captain." is: ng
The first potential match for "n.$" in "The Aardvarks belong to the Captain." is: n.
The first potential match for "\W\w" in "The Aardvarks belong to the Captain." is:  A
The first potential match for "\w[aeiou]tter" in "The Aardvarks belong to the Captain." is: no match.
The first potential match for "\w[aeiou]tter" in "The Aardvarks belong to the Captain." is: no match.
The first potential match for "..tt.." in "The Aardvarks belong to the Captain." is: no match.
The first potential match for "Aa" in "Bitter butter won't make the batter better." is: no match.
The first potential match for "[Aa][Aa]" in "Bitter butter won't make the batter better." is: no match.
The first potential match for "[aeiou][aeiou]" in "Bitter butter won't make the batter better." is: no match.
The first potential match for "[AaEeIiOoUu][aeiou]" in "Bitter butter won't make the batter better." is: no match.
The first potential match for "[Tt]he" in "Bitter butter won't make the batter better." is: the
The first potential match for "^[Tt]he" in "Bitter butter won't make the batter better." is: no match.
The first potential match for "n." in "Bitter butter won't make the batter better." is: n'
The first potential match for "n.$" in "Bitter butter won't make the batter better." is: no match.
The first potential match for "\W\w" in "Bitter butter won't make the batter better." is:  b
The first potential match for "\w[aeiou]tter" in "Bitter butter won't make the batter better." is: Bitter
The first potential match for "\w[aeiou]tter" in "Bitter butter won't make the batter better." is: Bitter
The first potential match for "..tt.." in "Bitter butter won't make the batter better." is: Bitter
The first potential match for "Aa" in "Hark, the pitter patter of little feet!" is: no match.
The first potential match for "[Aa][Aa]" in "Hark, the pitter patter of little feet!" is: no match.
The first potential match for "[aeiou][aeiou]" in "Hark, the pitter patter of little feet!" is: ee
The first potential match for "[AaEeIiOoUu][aeiou]" in "Hark, the pitter patter of little feet!" is: ee
The first potential match for "[Tt]he" in "Hark, the pitter patter of little feet!" is: the
The first potential match for "^[Tt]he" in "Hark, the pitter patter of little feet!" is: no match.
The first potential match for "n." in "Hark, the pitter patter of little feet!" is: no match.
The first potential match for "n.$" in "Hark, the pitter patter of little feet!" is: no match.
The first potential match for "\W\w" in "Hark, the pitter patter of little feet!" is:  t
The first potential match for "\w[aeiou]tter" in "Hark, the pitter patter of little feet!" is: pitter
The first potential match for "\w[aeiou]tter" in "Hark, the pitter patter of little feet!" is: pitter
The first potential match for "..tt.." in "Hark, the pitter patter of little feet!" is: pitter

Matching character sequences¶

In [35]:

def find_all_matches(pattern, string):
    matches = re.findall(pattern, string)
    if matches:
        return matches
    else:
        return None

for test_string in test_strings2:
    matches = [find_all_matches(pattern, test_string) for pattern in patterns3]
    
    for pattern in patterns3:
        if matches[patterns3.index(pattern)]:
            print 'All potential matches for "' + pattern.pattern + '" in "' + test_string + '" is/are: ' + ', '.join(matches[patterns3.index(pattern)])
        else:
            print 'There were no matches for "' + pattern.pattern + '" in "' + test_string + '".'

All potential matches for "Aa" in "The Aardvarks belong to the Captain." is/are: Aa
All potential matches for "[Aa][Aa]" in "The Aardvarks belong to the Captain." is/are: Aa
All potential matches for "[aeiou][aeiou]" in "The Aardvarks belong to the Captain." is/are: ai
All potential matches for "[AaEeIiOoUu][aeiou]" in "The Aardvarks belong to the Captain." is/are: Aa, ai
All potential matches for "[Tt]he" in "The Aardvarks belong to the Captain." is/are: The, the
All potential matches for "^[Tt]he" in "The Aardvarks belong to the Captain." is/are: The
All potential matches for "n." in "The Aardvarks belong to the Captain." is/are: ng, n.
All potential matches for "n.$" in "The Aardvarks belong to the Captain." is/are: n.
All potential matches for "\W\w" in "The Aardvarks belong to the Captain." is/are:  A,  b,  t,  t,  C
There were no matches for "\w[aeiou]tter" in "The Aardvarks belong to the Captain.".
There were no matches for "\w[aeiou]tter" in "The Aardvarks belong to the Captain.".
There were no matches for "..tt.." in "The Aardvarks belong to the Captain.".
There were no matches for "Aa" in "Bitter butter won't make the batter better.".
There were no matches for "[Aa][Aa]" in "Bitter butter won't make the batter better.".
There were no matches for "[aeiou][aeiou]" in "Bitter butter won't make the batter better.".
There were no matches for "[AaEeIiOoUu][aeiou]" in "Bitter butter won't make the batter better.".
All potential matches for "[Tt]he" in "Bitter butter won't make the batter better." is/are: the
There were no matches for "^[Tt]he" in "Bitter butter won't make the batter better.".
All potential matches for "n." in "Bitter butter won't make the batter better." is/are: n'
There were no matches for "n.$" in "Bitter butter won't make the batter better.".
All potential matches for "\W\w" in "Bitter butter won't make the batter better." is/are:  b,  w, 't,  m,  t,  b,  b
All potential matches for "\w[aeiou]tter" in "Bitter butter won't make the batter better." is/are: Bitter, butter, batter, better
All potential matches for "\w[aeiou]tter" in "Bitter butter won't make the batter better." is/are: Bitter, butter, batter, better
All potential matches for "..tt.." in "Bitter butter won't make the batter better." is/are: Bitter, butter, batter, better
There were no matches for "Aa" in "Hark, the pitter patter of little feet!".
There were no matches for "[Aa][Aa]" in "Hark, the pitter patter of little feet!".
All potential matches for "[aeiou][aeiou]" in "Hark, the pitter patter of little feet!" is/are: ee
All potential matches for "[AaEeIiOoUu][aeiou]" in "Hark, the pitter patter of little feet!" is/are: ee
All potential matches for "[Tt]he" in "Hark, the pitter patter of little feet!" is/are: the
There were no matches for "^[Tt]he" in "Hark, the pitter patter of little feet!".
There were no matches for "n." in "Hark, the pitter patter of little feet!".
There were no matches for "n.$" in "Hark, the pitter patter of little feet!".
All potential matches for "\W\w" in "Hark, the pitter patter of little feet!" is/are:  t,  p,  p,  o,  l,  f
All potential matches for "\w[aeiou]tter" in "Hark, the pitter patter of little feet!" is/are: pitter, patter
All potential matches for "\w[aeiou]tter" in "Hark, the pitter patter of little feet!" is/are: pitter, patter
All potential matches for "..tt.." in "Hark, the pitter patter of little feet!" is/are: pitter, patter, little

We have a new function and some new code. Let's go over it:

First, I wrote a function called find_all_matches:

def find_all_matches(pattern, string):
    matches = re.findall(pattern, string)
    if matches:
        return matches
    else:
        return None

There are only two differences between find_matches and find_all_matches. First, find_all_matches uses re.findall not re.search. So matches is a list of all possible matches. Thus, instead of return a single string in either condition, find_all_matches can return either a list of strings or None.

for test_string in test_strings2:
    matches = [find_all_matches(pattern, test_string) for pattern in patterns3]

    for pattern in patterns3:
        if matches[patterns3.index(pattern)]:

Remember the use of .index() from the previous code walkthrough. Also, remember that None returns false in a logical condition test. In this if statement, I'm testing to see if there were any matches for the current pattern in the loop. If there were any matches, the code will execute the next line. Otherwise, it will go to the else block.

            print 'All potential matches for "' + pattern.pattern + '" in "' + test_string + '" is/are: ' + ', '.join(matches[patterns3.index(pattern)])

If matches at the index of the current pattern is not None, it will be a list of strings. Because I'm printing these results, I wanted to nicely format them for diagnostic purposes. So we use the standard list-to-string Python expression of ''.join(list). In this case, I wanted the results to be comma-separated.

        else:
            print 'There were no matches for "' + pattern.pattern + '" in "' + test_string + '".'

Quantification and grouping¶

In [8]:

test_strings3 = ['Now Mr. N said, "Nooooooo!"',
                 'Then she told him he had to be quiet.']

patterns4 = [re.compile(r'No*'),
re.compile(r'No+'),
re.compile(r'No?'),
re.compile(r'No{7}'),
re.compile(r's?he'),
re.compile(r'(she|he)')]

for test_string in test_strings3:
    matches = [find_all_matches(pattern, test_string) for pattern in patterns4]
    
    for pattern in patterns4:
        if matches[patterns4.index(pattern)]:
            print 'All potential matches for "' + pattern.pattern + '" in "' + test_string + '" is/are: ' + ', '.join(matches[patterns4.index(pattern)])
        else:
            print 'There were no matches for "' + pattern.pattern + '" in "' + test_string + '".'

All potential matches for "No*" in "Now Mr. N said, "Nooooooo!"" is/are: No, N, Nooooooo
All potential matches for "No+" in "Now Mr. N said, "Nooooooo!"" is/are: No, Nooooooo
All potential matches for "No?" in "Now Mr. N said, "Nooooooo!"" is/are: No, N, No
All potential matches for "No{7}" in "Now Mr. N said, "Nooooooo!"" is/are: Nooooooo
There were no matches for "s?he" in "Now Mr. N said, "Nooooooo!"".
There were no matches for "(she|he)" in "Now Mr. N said, "Nooooooo!"".
There were no matches for "No*" in "Then she told him he had to be quiet.".
There were no matches for "No+" in "Then she told him he had to be quiet.".
There were no matches for "No?" in "Then she told him he had to be quiet.".
There were no matches for "No{7}" in "Then she told him he had to be quiet.".
All potential matches for "s?he" in "Then she told him he had to be quiet." is/are: he, she, he
All potential matches for "(she|he)" in "Then she told him he had to be quiet." is/are: he, she, he

Capturing groups¶

In Python, SRE_Match objects have .groups and .group methods. These correspond to the capturing groups established in the regex, if you chose to indicate groups. By default, the 0th group is the entire match to the whole regex. To access the result for a capturing group, you pass the capturing group index to the .group method.

In [37]:

test_strings4 = ['The benefit is being held for Mr. Kite and Mr. Henderson.',
                 'Tickets cost $5.00 for adults, $3.50 for children.',
                 'Over 9000 attendees are expected, up from 900 attendees last year.',
                 'Over 9,000 attendees are expected, up from 900 attendees last year.']

patterns5 = [re.compile(r'Mr\. (\w+)'),
re.compile(r'\$(\d+\.\d\d)'),
re.compile(r'(\d+) attendees'),
re.compile(r'((\d+,)*\d+) attendees')]

In [10]:

# simple example

matches = re.search(patterns5[3], test_strings4[3])
print 'Group 0: ' + matches.group(0)
print 'Group 1: ' + matches.group(1)
print 'Group 2: ' + matches.group(2)
#print 'Group 3: ' + matches.group(3) # what happens if you uncomment this?

Group 0: 9,000 attendees
Group 1: 9,000
Group 2: 9,

This example searched for r'((\d+,)*\d+) attendees' in the string "Over 9000 attendees are expected, up from 900 attendees last year.'" There are two groups, one nested inside the other. Groups are indexed outer-most left parens. This is why Group 1 is 9,000 and Group 2 is 9,.

In [46]:

for test_string in test_strings4:
    for pattern in patterns5:
        for result in re.finditer(pattern, test_string):
            for i in range(pattern.groups+1):
                
                print 'In "' + test_string + '", '  + 'given pattern "' + pattern.pattern + '", the group ' +str(i)+ ' match is ' + str(result.group(i))

In "The benefit is being held for Mr. Kite and Mr. Henderson.", given pattern "Mr\. (\w+)", the group 0 match is Mr. Kite
In "The benefit is being held for Mr. Kite and Mr. Henderson.", given pattern "Mr\. (\w+)", the group 1 match is Kite
In "The benefit is being held for Mr. Kite and Mr. Henderson.", given pattern "Mr\. (\w+)", the group 0 match is Mr. Henderson
In "The benefit is being held for Mr. Kite and Mr. Henderson.", given pattern "Mr\. (\w+)", the group 1 match is Henderson
In "Tickets cost $5.00 for adults, $3.50 for children.", given pattern "\$(\d+\.\d\d)", the group 0 match is $5.00
In "Tickets cost $5.00 for adults, $3.50 for children.", given pattern "\$(\d+\.\d\d)", the group 1 match is 5.00
In "Tickets cost $5.00 for adults, $3.50 for children.", given pattern "\$(\d+\.\d\d)", the group 0 match is $3.50
In "Tickets cost $5.00 for adults, $3.50 for children.", given pattern "\$(\d+\.\d\d)", the group 1 match is 3.50
In "Over 9000 attendees are expected, up from 900 attendees last year.", given pattern "(\d+) attendees", the group 0 match is 9000 attendees
In "Over 9000 attendees are expected, up from 900 attendees last year.", given pattern "(\d+) attendees", the group 1 match is 9000
In "Over 9000 attendees are expected, up from 900 attendees last year.", given pattern "(\d+) attendees", the group 0 match is 900 attendees
In "Over 9000 attendees are expected, up from 900 attendees last year.", given pattern "(\d+) attendees", the group 1 match is 900
In "Over 9000 attendees are expected, up from 900 attendees last year.", given pattern "((\d+,)*\d+) attendees", the group 0 match is 9000 attendees
In "Over 9000 attendees are expected, up from 900 attendees last year.", given pattern "((\d+,)*\d+) attendees", the group 1 match is 9000
In "Over 9000 attendees are expected, up from 900 attendees last year.", given pattern "((\d+,)*\d+) attendees", the group 2 match is None
In "Over 9000 attendees are expected, up from 900 attendees last year.", given pattern "((\d+,)*\d+) attendees", the group 0 match is 900 attendees
In "Over 9000 attendees are expected, up from 900 attendees last year.", given pattern "((\d+,)*\d+) attendees", the group 1 match is 900
In "Over 9000 attendees are expected, up from 900 attendees last year.", given pattern "((\d+,)*\d+) attendees", the group 2 match is None
In "Over 9,000 attendees are expected, up from 900 attendees last year.", given pattern "(\d+) attendees", the group 0 match is 000 attendees
In "Over 9,000 attendees are expected, up from 900 attendees last year.", given pattern "(\d+) attendees", the group 1 match is 000
In "Over 9,000 attendees are expected, up from 900 attendees last year.", given pattern "(\d+) attendees", the group 0 match is 900 attendees
In "Over 9,000 attendees are expected, up from 900 attendees last year.", given pattern "(\d+) attendees", the group 1 match is 900
In "Over 9,000 attendees are expected, up from 900 attendees last year.", given pattern "((\d+,)*\d+) attendees", the group 0 match is 9,000 attendees
In "Over 9,000 attendees are expected, up from 900 attendees last year.", given pattern "((\d+,)*\d+) attendees", the group 1 match is 9,000
In "Over 9,000 attendees are expected, up from 900 attendees last year.", given pattern "((\d+,)*\d+) attendees", the group 2 match is 9,
In "Over 9,000 attendees are expected, up from 900 attendees last year.", given pattern "((\d+,)*\d+) attendees", the group 0 match is 900 attendees
In "Over 9,000 attendees are expected, up from 900 attendees last year.", given pattern "((\d+,)*\d+) attendees", the group 1 match is 900
In "Over 9,000 attendees are expected, up from 900 attendees last year.", given pattern "((\d+,)*\d+) attendees", the group 2 match is None

Before we go over this code block, let's establish the purpose of the code. I wanted to return all the matches for each group. But there are a few concerns:

The number of groups is different for each pattern. So I can't hardcode the number of times to loop over. In other words, the number of times my loop should iterate has to be dynamically assigned, conditioning on which regex pattern is the comparison regex in the loop.
`.findall' return a list of matches, and if there are groups, it will return a list of tuples, where each tuple is the length of the number of capturing groups.

In [47]:

matches = re.findall(patterns5[3], test_strings4[3])
matches

Out[47]:

[('9,000', '9,'), ('900', '')]

You can refer to the index of a tuple within a list of tuples through indexing a second index:

In [48]:

matches[0][0]

Out[48]:

'9,000'

But there are other ways of constructing this kind of loop.

for test_string in test_strings4:
    for pattern in patterns5:
        for result in re.finditer(pattern, test_string):

re.finditer returns an iterator, which is a new Python concept to you. This loop means that for every pattern and for each string we're testing, instead of creating a list of matches, we're going to create a iterator object that contains the results.

            for i in range(pattern.groups+1):

The .groups method will list the number of capturing groups in the regular expression. range is a function that will return a list of integers ranging from a start or a stop value and by a step value. If you just give it a int, by default it will treat that value is a stopping value and start from 0. Now, we add 1 to this value because the end point is omitted in range. If we want to return all the groups, we have to add that end point back.

                print 'In "' + test_string + '", '  + 'given pattern "' + pattern.pattern + '", the group ' +str(i)+ ' match is ' + str(result.group(i))

Because i is established as the index value of the current regex match produced by the iterator, we can use i as the index value for which group we'd like to return. That's why we can call result.group(i).

In no way was this the only way to accomplish this task! I wanted to show you a few different functions in this tutorial, as well as introduce you to the more examples where typical "Pythonic" code constructions are useful, such as list comprehensions and join. There are many ways of replicating all of these diagnostic printout examples.

Final exercise¶

Let's see how much you've learned. We're going to give you three strings that have a phone number in them. Your job is to write a regex that will return the full form of all of them.

In [12]:

phone_strings = ['Call Empire Carpets at 588-2300',
'Does Jenny live at 867 5309?',
'You can reach Mr. Plow at 636-555-3226']