1. NLTK Syntax Trees

In this notebook, I'll play with NLTK syntax trees. Actually, they are general trees, and they allow an interesting set of operations. The complete reference can be found here

Each Tree represents a single hierarchical grouping of leaves and subtrees. For example, each constituent in a syntax tree is represented by a single Tree. Let's load into nltk some trees, from the WSJ corpus.

In [1]:
import nltk
from nltk.corpus import treebank
t = treebank.parsed_sents('wsj_0001.mrg')[0]
print t
(S
  (NP-SBJ
    (NP (NNP Pierre) (NNP Vinken))
    (, ,)
    (ADJP (NP (CD 61) (NNS years)) (JJ old))
    (, ,))
  (VP
    (MD will)
    (VP
      (VB join)
      (NP (DT the) (NN board))
      (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
      (NP-TMP (NNP Nov.) (CD 29))))
  (. .))

A tree's children are encoded as a list of leaves and subtrees, where a leaf is a basic (non-tree) value; and a subtree is a nested Tree. For syntax trees, the node property of the list holds the phrase tag of the constituent, and words are at the tree's leaves.

In [2]:
print t.label()
S

Trees are lists, and the list elements are the children of the root node. They can be Trees o basic values (leafs). Let's show the first child:

In [3]:
print t[0]
(NP-SBJ
  (NP (NNP Pierre) (NNP Vinken))
  (, ,)
  (ADJP (NP (CD 61) (NNS years)) (JJ old))
  (, ,))

.... and now the second and third

In [4]:
print t[1]
(VP
  (MD will)
  (VP
    (VB join)
    (NP (DT the) (NN board))
    (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
    (NP-TMP (NNP Nov.) (CD 29))))
In [5]:
print t[2]
(. .)

We can this way access arbitrarly deep nodes, just indexing lists:

In [6]:
print t[0][0]
print t[0][0][1]
print t[0][0][1][0]
(NP (NNP Pierre) (NNP Vinken))
(NNP Vinken)
Vinken

From the NLTK book: Several Tree methods use tree positions to specify children or descendants of a tree. Tree positions are defined as follows:

  • The tree position $i$ specifies a $Tree$'s $i$th child.
  • The tree position $()$ specifies the $Tree$ itself.
  • If $p$ is the tree position of descendant $d$, then $p+(i)$ specifies the $i$th child of $d$.

I.e., every tree position is either a single index $i$, specifying $self[i]$; or a sequence ($i_1, i_2, \ldots, i_N$), specifying $self[i_1][i_2]...[i_N]$.

In [7]:
print t.treepositions(order='preorder')
[(), (0,), (0, 0), (0, 0, 0), (0, 0, 0, 0), (0, 0, 1), (0, 0, 1, 0), (0, 1), (0, 1, 0), (0, 2), (0, 2, 0), (0, 2, 0, 0), (0, 2, 0, 0, 0), (0, 2, 0, 1), (0, 2, 0, 1, 0), (0, 2, 1), (0, 2, 1, 0), (0, 3), (0, 3, 0), (1,), (1, 0), (1, 0, 0), (1, 1), (1, 1, 0), (1, 1, 0, 0), (1, 1, 1), (1, 1, 1, 0), (1, 1, 1, 0, 0), (1, 1, 1, 1), (1, 1, 1, 1, 0), (1, 1, 2), (1, 1, 2, 0), (1, 1, 2, 0, 0), (1, 1, 2, 1), (1, 1, 2, 1, 0), (1, 1, 2, 1, 0, 0), (1, 1, 2, 1, 1), (1, 1, 2, 1, 1, 0), (1, 1, 2, 1, 2), (1, 1, 2, 1, 2, 0), (1, 1, 3), (1, 1, 3, 0), (1, 1, 3, 0, 0), (1, 1, 3, 1), (1, 1, 3, 1, 0), (2,), (2, 0)]

Let's see some methods.

In [8]:
print t.leaves()
[u'Pierre', u'Vinken', u',', u'61', u'years', u'old', u',', u'will', u'join', u'the', u'board', u'as', u'a', u'nonexecutive', u'director', u'Nov.', u'29', u'.']
In [9]:
print t.flatten() # A tree with only the root and the leaves directly connected to it
(S
  Pierre
  Vinken
  ,
  61
  years
  old
  ,
  will
  join
  the
  board
  as
  a
  nonexecutive
  director
  Nov.
  29
  .)

We can generate all the tree's subtrees;

In [10]:
for subtree in t.subtrees(): # Generate all subtrees
    print subtree
(S
  (NP-SBJ
    (NP (NNP Pierre) (NNP Vinken))
    (, ,)
    (ADJP (NP (CD 61) (NNS years)) (JJ old))
    (, ,))
  (VP
    (MD will)
    (VP
      (VB join)
      (NP (DT the) (NN board))
      (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
      (NP-TMP (NNP Nov.) (CD 29))))
  (. .))
(NP-SBJ
  (NP (NNP Pierre) (NNP Vinken))
  (, ,)
  (ADJP (NP (CD 61) (NNS years)) (JJ old))
  (, ,))
(NP (NNP Pierre) (NNP Vinken))
(NNP Pierre)
(NNP Vinken)
(, ,)
(ADJP (NP (CD 61) (NNS years)) (JJ old))
(NP (CD 61) (NNS years))
(CD 61)
(NNS years)
(JJ old)
(, ,)
(VP
  (MD will)
  (VP
    (VB join)
    (NP (DT the) (NN board))
    (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
    (NP-TMP (NNP Nov.) (CD 29))))
(MD will)
(VP
  (VB join)
  (NP (DT the) (NN board))
  (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
  (NP-TMP (NNP Nov.) (CD 29)))
(VB join)
(NP (DT the) (NN board))
(DT the)
(NN board)
(PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
(IN as)
(NP (DT a) (JJ nonexecutive) (NN director))
(DT a)
(JJ nonexecutive)
(NN director)
(NP-TMP (NNP Nov.) (CD 29))
(NNP Nov.)
(CD 29)
(. .)

And (this is awesome!) we can filter according to certain criteria. For example, let's filter NP constituents

In [11]:
def filt(x):
    return x.label()=='NP'

for subtree in t.subtrees(filter =  filt): # Generate all subtrees
    print subtree
(NP (NNP Pierre) (NNP Vinken))
(NP (CD 61) (NNS years))
(NP (DT the) (NN board))
(NP (DT a) (JJ nonexecutive) (NN director))

Or just use a lambda function, to filter singular nouns:

In [12]:
for subtree in t.subtrees(filter = lambda st: st.label()=='NN'):
    print subtree
(NN board)
(NN director)

We can show POS (they are simply preterminals in the tree)

In [13]:
t.pos()
Out[13]:
[(u'Pierre', u'NNP'),
 (u'Vinken', u'NNP'),
 (u',', u','),
 (u'61', u'CD'),
 (u'years', u'NNS'),
 (u'old', u'JJ'),
 (u',', u','),
 (u'will', u'MD'),
 (u'join', u'VB'),
 (u'the', u'DT'),
 (u'board', u'NN'),
 (u'as', u'IN'),
 (u'a', u'DT'),
 (u'nonexecutive', u'JJ'),
 (u'director', u'NN'),
 (u'Nov.', u'NNP'),
 (u'29', u'CD'),
 (u'.', u'.')]

Before continuing, let's try to draw the tree. I will use two methods I have developed: pln_inco.syntax_trees.tree_to_dot to convert NLTK tree rep to a graphviz spec, and pln_inco.graphviz.generate to actually call dot and generate the image.

In [14]:
import pln_inco.syntax_trees
import pln_inco.graphviz as gv
from IPython.display import Image
from IPython.display import display
from IPython.display import display_png
In [15]:
tree_dot=pln_inco.syntax_trees.tree_to_dot(t)

tree_png=Image(data=gv.generate(tree_dot,format='png'),width=1000)
display_png(tree_png)

Let's look at treepositions again, since they are interesting. Let's find the tree position of "board", the 11th word (leave) in the sentence (tree)

In [16]:
print t.leaf_treeposition(10) #Note that leaves are numbered starting from 0!
(1, 1, 1, 1, 0)

We should read this treeposition as: Start from the root, select the second child for times, and select the first child. We will come back later with treepositions.

There another, interesting, function: suppose we want to find the node the covers "join the board as a nonexecutive director". We can use:

In [17]:
print t.treeposition_spanning_leaves(8,14)
(1, 1)

That is the t[1][1] tree (observe that the included span is larger than ours, because our span did not matched a complete constituent). The definiion of the function says: the tree position of the lowest descendant of this tree that dominates self.leaves()[start:end].

In [18]:
print t[1][1]
(VP
  (VB join)
  (NP (DT the) (NN board))
  (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
  (NP-TMP (NNP Nov.) (CD 29)))

References