In this notebook, I'll play with NLTK syntax trees. Actually, they are general trees, and they allow an interesting set of operations. The complete reference can be found here
Each Tree represents a single hierarchical grouping of leaves and subtrees. For example, each constituent in a syntax tree is represented by a single Tree. Let's load into nltk some trees, from the WSJ corpus.
import nltk
from nltk.corpus import treebank
t = treebank.parsed_sents('wsj_0001.mrg')[0]
print t
(S (NP-SBJ (NP (NNP Pierre) (NNP Vinken)) (, ,) (ADJP (NP (CD 61) (NNS years)) (JJ old)) (, ,)) (VP (MD will) (VP (VB join) (NP (DT the) (NN board)) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director))) (NP-TMP (NNP Nov.) (CD 29)))) (. .))
A tree's children are encoded as a list of leaves and subtrees, where a leaf is a basic (non-tree) value; and a subtree is a nested Tree. For syntax trees, the node property of the list holds the phrase tag of the constituent, and words are at the tree's leaves.
print t.label()
S
Trees are lists, and the list elements are the children of the root node. They can be Trees o basic values (leafs). Let's show the first child:
print t[0]
(NP-SBJ (NP (NNP Pierre) (NNP Vinken)) (, ,) (ADJP (NP (CD 61) (NNS years)) (JJ old)) (, ,))
.... and now the second and third
print t[1]
(VP (MD will) (VP (VB join) (NP (DT the) (NN board)) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director))) (NP-TMP (NNP Nov.) (CD 29))))
print t[2]
(. .)
We can this way access arbitrarly deep nodes, just indexing lists:
print t[0][0]
print t[0][0][1]
print t[0][0][1][0]
(NP (NNP Pierre) (NNP Vinken)) (NNP Vinken) Vinken
From the NLTK book: Several Tree methods use tree positions to specify children or descendants of a tree. Tree positions are defined as follows:
I.e., every tree position is either a single index i, specifying self[i]; or a sequence (i1,i2,…,iN), specifying self[i1][i2]...[iN].
print t.treepositions(order='preorder')
[(), (0,), (0, 0), (0, 0, 0), (0, 0, 0, 0), (0, 0, 1), (0, 0, 1, 0), (0, 1), (0, 1, 0), (0, 2), (0, 2, 0), (0, 2, 0, 0), (0, 2, 0, 0, 0), (0, 2, 0, 1), (0, 2, 0, 1, 0), (0, 2, 1), (0, 2, 1, 0), (0, 3), (0, 3, 0), (1,), (1, 0), (1, 0, 0), (1, 1), (1, 1, 0), (1, 1, 0, 0), (1, 1, 1), (1, 1, 1, 0), (1, 1, 1, 0, 0), (1, 1, 1, 1), (1, 1, 1, 1, 0), (1, 1, 2), (1, 1, 2, 0), (1, 1, 2, 0, 0), (1, 1, 2, 1), (1, 1, 2, 1, 0), (1, 1, 2, 1, 0, 0), (1, 1, 2, 1, 1), (1, 1, 2, 1, 1, 0), (1, 1, 2, 1, 2), (1, 1, 2, 1, 2, 0), (1, 1, 3), (1, 1, 3, 0), (1, 1, 3, 0, 0), (1, 1, 3, 1), (1, 1, 3, 1, 0), (2,), (2, 0)]
Let's see some methods.
print t.leaves()
[u'Pierre', u'Vinken', u',', u'61', u'years', u'old', u',', u'will', u'join', u'the', u'board', u'as', u'a', u'nonexecutive', u'director', u'Nov.', u'29', u'.']
print t.flatten() # A tree with only the root and the leaves directly connected to it
(S Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .)
We can generate all the tree's subtrees;
for subtree in t.subtrees(): # Generate all subtrees
print subtree
(S (NP-SBJ (NP (NNP Pierre) (NNP Vinken)) (, ,) (ADJP (NP (CD 61) (NNS years)) (JJ old)) (, ,)) (VP (MD will) (VP (VB join) (NP (DT the) (NN board)) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director))) (NP-TMP (NNP Nov.) (CD 29)))) (. .)) (NP-SBJ (NP (NNP Pierre) (NNP Vinken)) (, ,) (ADJP (NP (CD 61) (NNS years)) (JJ old)) (, ,)) (NP (NNP Pierre) (NNP Vinken)) (NNP Pierre) (NNP Vinken) (, ,) (ADJP (NP (CD 61) (NNS years)) (JJ old)) (NP (CD 61) (NNS years)) (CD 61) (NNS years) (JJ old) (, ,) (VP (MD will) (VP (VB join) (NP (DT the) (NN board)) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director))) (NP-TMP (NNP Nov.) (CD 29)))) (MD will) (VP (VB join) (NP (DT the) (NN board)) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director))) (NP-TMP (NNP Nov.) (CD 29))) (VB join) (NP (DT the) (NN board)) (DT the) (NN board) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director))) (IN as) (NP (DT a) (JJ nonexecutive) (NN director)) (DT a) (JJ nonexecutive) (NN director) (NP-TMP (NNP Nov.) (CD 29)) (NNP Nov.) (CD 29) (. .)
And (this is awesome!) we can filter according to certain criteria. For example, let's filter NP constituents
def filt(x):
return x.label()=='NP'
for subtree in t.subtrees(filter = filt): # Generate all subtrees
print subtree
(NP (NNP Pierre) (NNP Vinken)) (NP (CD 61) (NNS years)) (NP (DT the) (NN board)) (NP (DT a) (JJ nonexecutive) (NN director))
Or just use a lambda function, to filter singular nouns:
for subtree in t.subtrees(filter = lambda st: st.label()=='NN'):
print subtree
(NN board) (NN director)
We can show POS (they are simply preterminals in the tree)
t.pos()
[(u'Pierre', u'NNP'), (u'Vinken', u'NNP'), (u',', u','), (u'61', u'CD'), (u'years', u'NNS'), (u'old', u'JJ'), (u',', u','), (u'will', u'MD'), (u'join', u'VB'), (u'the', u'DT'), (u'board', u'NN'), (u'as', u'IN'), (u'a', u'DT'), (u'nonexecutive', u'JJ'), (u'director', u'NN'), (u'Nov.', u'NNP'), (u'29', u'CD'), (u'.', u'.')]
Before continuing, let's try to draw the tree. I will use two methods I have developed: pln_inco.syntax_trees.tree_to_dot to convert NLTK tree rep to a graphviz spec, and pln_inco.graphviz.generate to actually call dot and generate the image.
import pln_inco.syntax_trees
import pln_inco.graphviz as gv
from IPython.display import Image
from IPython.display import display
from IPython.display import display_png
tree_dot=pln_inco.syntax_trees.tree_to_dot(t)
tree_png=Image(data=gv.generate(tree_dot,format='png'),width=1000)
display_png(tree_png)
Let's look at treepositions again, since they are interesting. Let's find the tree position of "board", the 11th word (leave) in the sentence (tree)
print t.leaf_treeposition(10) #Note that leaves are numbered starting from 0!
(1, 1, 1, 1, 0)
We should read this treeposition as: Start from the root, select the second child for times, and select the first child. We will come back later with treepositions.
There another, interesting, function: suppose we want to find the node the covers "join the board as a nonexecutive director". We can use:
print t.treeposition_spanning_leaves(8,14)
(1, 1)
That is the t[1][1] tree (observe that the included span is larger than ours, because our span did not matched a complete constituent). The definiion of the function says: the tree position of the lowest descendant of this tree that dominates self.leaves()[start:end].
print t[1][1]
(VP (VB join) (NP (DT the) (NN board)) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director))) (NP-TMP (NNP Nov.) (CD 29)))