Week 3: Regular expressions

In [1]:
from tock import *

Monday reading

Read Section 1.3, but you can save Lemma 1.60 and its proof for next time.

Tuesday class

Regular expressions

Regular expressions were invented by Stephen Kleene (pronounced clay-knee) back in the 1950s as a characterization of the languages recognized by the earliest neural networks. But they became widely used through various Unix tools, like grep, which is where you have likely encountered them.

Regex Golf

There are some differences between Unix regular expressions and true regular expressions. True regular expressions have three operations: concatenation, union ($\cup$ in the book, | in Unix), and Kleene star (*). The order of operations is star, then union, then concatenation. Use parentheses to change the order of operations, just as in arithmetic expressions.

Unix regular expressions use a dot (.) to match any symbol; the book uses $\Sigma$ for this purpose.

Another difference is that Unix regular expressions usually match anywhere in a string, whereas true regular expressions usually must match the entire string. When using grep, use the -Ex flags to approximate true regular expressions (-E to get the union operator, -x to match the whole line).

In [2]:
%%file movies.txt
the phantom menace
attack of the clones
revenge of the sith
a new hope
the empire strikes back
return of the jedi
the force awakens
rogue one
the last jedi
the rise of skywalker
the motion picture
the wrath of khan
the search for spock
the voyage home
the final frontier
the undiscovered country
first contact
into darkness
In [1]:
!grep -Ex '.*(m | (t|n)|b).*' movies.txt
the phantom menace
attack of the clones
revenge of the sith
a new hope
the empire strikes back
return of the jedi

Almost right.

Question. Can you fix the regular expression so that it also accepts the force awakens, rogue one, the last jedi, solo, and the rise of skywalker, but not beyond?

Converting regular expressions to NFAs

Regular expressions are the third "model of computation" of this course, and they, too, are equivalent to both DFAs and NFAs. The proof involves algorithms to convert between regular expressions and NFAs. Today we're converting regular expressions to NFAs, which is not as difficult as the subset construction from last time.

The algorithm in the book is a slight variation on the algorithm from a paper by Ken Thompson, one of the co-developers of Unix, for the QED editor, from which are descended sed, grep, and vi. (But many modern regular expression engines actually do not use this algorithm; they use one that is asymptotically much slower!)

Here's a step by step visualization of the example from the book. You can try different regular expressions to see how they get converted. (Note that in Tock, unlike in Unix tools, symbols in regular expressions are separated by spaces.)

In [4]:
m = from_regexp('(a b|a)*', display_steps=True)
subexpression: a
%3 _START 0 q1 _START->0 1 q2 0->1 a
subexpression: b
%3 _START 1 q3 _START->1 0 q4 1->0 b
subexpression: a b
%3 _START 2 q1 _START->2 0 q2 1 q3 0->1 ε 3 q4 1->3 b 2->0 a
subexpression: a
%3 _START 1 q5 _START->1 0 q6 1->0 a
subexpression: a b|a
%3 _START 4 q7 _START->4 0 q1 1 q2 0->1 a 2 q3 1->2 ε 5 q4 2->5 b 3 q5 6 q6 3->6 a 4->0 ε 4->3 ε
subexpression: (a b|a)*
%3 _START 3 q8 _START->3 0 q1 1 q2 0->1 a 2 q3 1->2 ε 7 q4 2->7 b 5 q7 3->5 ε 4 q5 6 q6 4->6 a 5->0 ε 5->4 ε 6->5 ε 7->5 ε
In [5]:
m = from_regexp('(a|b)* a b a', display_steps=True)
subexpression: a
%3 _START 1 q01 _START->1 0 q02 1->0 a
subexpression: b
%3 _START 0 q03 _START->0 1 q04 0->1 b
subexpression: a|b
%3 _START 1 q05 _START->1 0 q03 3 q04 0->3 b 1->0 ε 4 q01 1->4 ε 2 q02 4->2 a
subexpression: (a|b)*
%3 _START 3 q06 _START->3 0 q03 1 q04 0->1 b 4 q05 1->4 ε 2 q02 2->4 ε 3->4 ε 4->0 ε 5 q01 4->5 ε 5->2 a
subexpression: a
%3 _START 1 q07 _START->1 0 q08 1->0 a
subexpression: b
%3 _START 0 q09 _START->0 1 q10 0->1 b
subexpression: a
%3 _START 1 q11 _START->1 0 q12 1->0 a
subexpression: (a|b)* a b a
%3 _START 6 q06 _START->6 0 q04 2 q05 0->2 ε 4 q07 0->4 ε 1 q03 1->0 b 2->1 ε 10 q01 2->10 ε 3 q11 11 q12 3->11 a 5 q08 4->5 a 7 q09 5->7 ε 6->2 ε 6->4 ε 8 q10 7->8 b 8->3 ε 9 q02 9->2 ε 9->4 ε 10->9 a

Question. If a regular expression has $n$ symbols, how big will the resulting NFA be?

For a very cool "real-time" visualization of (a slightly different version of) this construction, see Debuggex.

Wednesday reading

Read or reread Lemma 1.60 and its proof, which (in my opinion) is the most complicated one so far.

Thursday class

Please have your teams for the programming project registered by the end of this week.

NFAs to regular expressions

Today we're completing the proof from last time by converting NFAs to regular expressions. The algorithm for this is known as the state elimination algorithm, because it eliminates the states of the NFA one by one.

To make this construction simpler, the new concept of a generalized NFA (GNFA) is introduced. In this course, GNFAs are a "throwaway" formalism; they exist just to make this proof possible. But elsewhere they might be useful in their own right. A GNFA is a NFA whose transitions can be labeled with regular expressions instead of symbols. A GNFA can follow a transition if the next $k \geq 0$ symbols match the regular expression; when it follows the transition, it consumes those $k$ symbols.

The algorithm first changes the NFA into a GNFA that has a single start state with no incoming edges and a single accept state with no outgoing edges. This makes the final step of the algorithm simpler. The book also creates a lot of transitions labeled $\emptyset$; these are not shown in the figures below, and they're not shown in the book, either. In fact, as far as I'm concerned, you can pretend they don't exist.

The algorithm then eliminates the (non-start, non-accept) states one by one. When there is just one transition left, its label is the final answer. (If you're not using $\emptyset$ transitions, then there is a special case: if there are no transitions left, then the answer is $\emptyset$.) The order of elimination is arbitrary, although some orders may lead to more compact regular expressions than others. Tock eliminates states in reverse alphabetical order, for no particularly good reason.

Let's start with a simpler example than the one in the book, a NFA whose state diagram is acyclic.

In [6]:
m = read_csv('acyclic.csv')
%3 _START 0 1 _START->0 2 2 0->2 a 3 4 0->3 a 1 3 1->3 c 2->1 b
In [7]:
to_regexp(m, display_steps=True)
%3 _START 0 start _START->0 2 1 0->2 ε 1 accept 4 2 2->4 a 5 4 2->5 a 3 3 3->5 c 4->3 b 5->1 ε
eliminate 4
%3 _START 0 start _START->0 2 1 0->2 ε 1 accept 2->1 a 4 2 2->4 a 3 3 3->1 c 4->3 b
eliminate 3
%3 _START 0 start _START->0 2 1 0->2 ε 1 accept 2->1 a 3 2 2->3 a 3->1 b c
eliminate 2
%3 _START 0 start _START->0 2 1 0->2 ε 1 accept 2->1 a|a b c
eliminate 1
%3 _START 0 start _START->0 1 accept 0->1 a|a b c
a|a b c

Eliminating 4 is not very interesting; let's skip it for now. Eliminating 3 concatenates the b and c transitions into a single bc transition. Then, eliminating 2 concatenates the a and bc transitions into a single abc transition. But now there are two parallel transitions, from state 1 to the accept state. So these are unioned into a single a|abc transition. Finally, eliminating 1 is not very interesting.

The general procedure for eliminating a state is shown in Figure 1.63. Now here's an example with a cycle in it; these give rise to Kleene stars in the resulting expression.

In [8]:
m66 = read_csv('sipser-1-66.csv')
%3 _START 0 1 _START->0 0->0 a 1 2 0->1 b 1->1 a b
In [9]:
to_regexp(m66, display_steps=True)
%3 _START 0 start _START->0 3 1 0->3 ε 1 accept 2 2 2->1 ε 2->2 a|b 3->2 b 3->3 a
eliminate 2
%3 _START 0 start _START->0 2 1 0->2 ε 1 accept 2->1 b (a|b)* 2->2 a
eliminate 1
%3 _START 0 start _START->0 1 accept 0->1 a* b (a|b)*
a* b (a|b)*

Eliminating state 2 looks similar to before, but the self-loops turn into subexpressions inside Kleene stars.

One more example:

In [10]:
m68 = read_csv('sipser-1-68.csv')
%3 _START 0 1 _START->0 1 3 0->1 b 2 2 0->2 a 1->0 b 1->2 a 2->0 a 2->2 b
In [11]:
to_regexp(m68, display_steps=True)
%3 _START 0 start _START->0 4 1 0->4 ε 1 accept 2 2 2->1 ε 2->2 b 2->4 a 3 3 3->1 ε 3->2 a 3->4 b 4->2 a 4->3 b
eliminate 3
%3 _START 0 start _START->0 3 1 0->3 ε 1 accept 2 2 2->1 ε 2->2 b 2->3 a 3->1 b 3->2 a|b a 3->3 b b
eliminate 2
%3 _START 0 start _START->0 2 1 0->2 ε 1 accept 2->1 b|(a|b a) b* 2->2 b b|(a|b a) b* a
eliminate 1
%3 _START 0 start _START->0 1 accept 0->1 (b b|(a|b a) b* a)* (b|(a|b a) b*)
(b b|(a|b a) b* a)* (b|(a|b a) b*)

(The answer is different from the book's because the elimination order was different.) The thing to note here is that if there are multiple incoming transitions and/or multiple outgoing transitions, you must consider all possible combinations. For example, when eliminating state 3, there are two outgoing transitions (to states 1 and 2), so a self-loop from 1 to 1 is created, and the transition from 1 to 2 is modified.