from tock import *
Read Section 2.1, but skip "Chomsky Normal Form."
We're beginning the second unit of the course, in which we look at context-free grammars and pushdown automata. Recall that in the first class we briefly introduced Turing machines, and then introduced finite automata as being like Turing machines but restricted to have only a one-way, read-only tape. Pushdown automata are a little bit less restricted: they have a one-way, read-only tape, but they also have a stack. But the book starts with context-free grammars, probably because they're a little more familiar and/or easier to understand.
Context-free grammars were invented in the late 1950s by Noam Chomsky as a way of describing the syntax of human languages (he called them phrase-structure grammars). Later, they were appropriate by the inventors of the programming language ALGOL-60 as a way to describe that language's syntax (which came to be called Backus-Naur form).
There is some variation in terminology that the book doesn't mention. Most importantly, variables are also called nonterminal symbols or nonterminals. You'll hear me call them nonterminals almost exclusively -- I can't get myself to call them variables.
We can start with a simpler example than the book's. \begin{align*} S &\rightarrow \mathtt{a}~S~\mathtt{b} \\ S &\rightarrow \varepsilon \end{align*} Question. What language does this CFG generate?
And here's a version of Example 2.3 that is rewritten using parentheses: \begin{align*} S &\rightarrow \mathtt{(}~S~\mathtt{)} \\ S &\rightarrow S~S \\ S &\rightarrow \varepsilon \end{align*}
You can see a CFG for C in the draft standard, starting on page 458, or as a YACC grammar.
Generalized Phrase Structure Grammar was an attempt to write a CFG for English. The Berkeley Parser builds a CFG for English partly using human input and partly automatically.
L-systems are a kind of visual grammar; context-free L-systems can be used to draw surprisingly natural-looking images, especially of trees:
A language is context-free iff it can be generated by a context-free grammar.
Question. Show that context-free languages are closed under union, concatenation, and Kleene star.
We will see later that context-free languagse are not closed under intersection or complement.
If a string can be generated by a CFG, we can write the rules used to generate it as a tree (also called a derivation, a syntax tree, a parse tree, or a phrase-structure tree). This abstracts away from variations in the ordering of rules that we don't care about.
However, it can be that a string has two or more trees. This is known as ambiguity. Ambiguity is pervasive in natural language, but in programming languages is considered a very bad thing.
To a certain extent, we can eliminate ambiguity by rewriting the grammar. Example 2.4 shows how to implement operator precedence in a CFG. In CP2, you'll write a CFG for regular expressions using the same technique.
Parsing is the problem of deciding whether a string can be generated by CFG, and if so, what is its tree (or trees). This is a huge subject. In CP2, you'll write a simple parser, but we leave a full treatment for other courses (Compilers for programming languages, NLP for natural languages).
Read Section 2.2, up to but not including "Equivalence with Context-Free Grammars".
Pushdown automata are equipped with, in addition to the input tape, a pushdown store, said to have been inspired by the tray dispenser in a university dining hall. Nowadays, a pushdown store is more commonly known as a stack, which you should be quite familiar with. Every transition can read an input symbol (or $\varepsilon$), but it can also pop a symbol from the stack and/or push a symbol onto the stack.
The transition labels are always of the form $$a, x \rightarrow y$$ where $a$ is the input symbol read, $x$ is the stack symbol popped, and $y$ is the stack symbol pushed.
Here's an example (2.14):
m1 = read_csv("pda-m1.csv")
m1
The $\varepsilon, \varepsilon \rightarrow \texttt{\$}$ transition initializes the stack to $\texttt{\$}$. (The vast majority of PDAs that you write will start with this step.) The $0,\varepsilon \rightarrow 0$ transition means "read a 0
and push a 0
." The $1,0\rightarrow \varepsilon$ transitions mean "read a 1
and pop a 0
". Finally, the $\varepsilon, \texttt{\$} \rightarrow \varepsilon$ transition checks that the stack is empty (save for the bottom marker $
) at the end of the string.
Note that we didn't have to pop the $
at the end; we could have left it on there or replaced it with anything, and the machine would still work the same. (There are alternative definitions of PDAs that do require that the stack be empty at the end; we never use this definition, but be careful if you consult other sources.)
Let's look at how it works on an example string:
run(m1, "0 0 0 1 1 1")
This graph represents all possible runs of the PDA on the string 000111
. Each node shows both the state and the stack at each time step. The stacks are shown with the top on the left. The square brackets are an artifact of how Tock words and always mark the top of the stack. Note that when the stack gets deeper, ellipses (…) are used to keep the node labels compact.
This happens to be a deterministic PDA: at every time step, there's only one possibility for the next step.
More examples from the book:
m2 = read_csv("pda-m2.csv")
m2
Question. Does the above PDA accept the strings $\mathtt{aabb}$? $\mathtt{aabc}$? $\mathtt{aacc}$?
m3 = read_csv("pda-m3.csv")
m3
run(m3, "0 1 0 0 1 0")
Question. Design a PDA that recognizes the language of matching left and right parentheses (like Example 2.3).
Question. Design a PDA that recognizes the language over $\Sigma = \{\mathtt{a}, \mathtt{u}, \mathtt{c}, \mathtt{g}\}$ such that every symbol is paired with exactly one other symbol -- $\mathtt{a}$ with $\mathtt{u}$ and $\mathtt{c}$ with $\mathtt{g}$, and the pairings are nested like parentheses in the previous question.
Question. Do you think a queue automaton would be more or less powerful than a pushdown (stack) automaton?