The idea of this section is to analyze a sentence using the Stanford Parser (both for constituence and dependency) and display its results
import pln_inco.syntax_trees
import pln_inco.graphviz as gv
from IPython.display import *
import nltk.tree
import pln_inco.stanford_parser
I've prepared a little method to call the Stanford Parser (it assumes you have the parser installed and the CLASSPATH set, so you can call the parser from any directory), using (till now) the LexicalizedParser, trained with an english probabilistic context free grammar. Let's call it with a couple of sentences, and display their constituent trees.
my_sentence='Bills on ports and immigration were submitted by Senator Brownback.'
# You can change this sentence, if you don't like it!
pr=pln_inco.stanford_parser.lexicalized_parser_parse([my_sentence],model='englishPCFG')
print pr
for p in pr:
t=nltk.tree.Tree.fromstring(p)
tree_dot=pln_inco.syntax_trees.tree_to_dot(t)
tree_png=Image(data=gv.generate(tree_dot,format='png'))
display_png(tree_png)
['(ROOT\n (S\n (NP\n (NP (NNS Bills))\n (PP (IN on)\n (NP (NNS ports)\n (CC and)\n (NN immigration))))\n (VP (VBD were)\n (VP (VBN submitted)\n (PP (IN by)\n (NP (NNP Senator) (NNP Brownback.)))))))']
The Stanford parser can also return typed Dependencies, generated from the constituent tree (an introduction in this paper, further explained here). I've added a new argument to the method, allowing to specify the output (the original, default, value is 'penn'). Now, using 'basicDependencies' as an options, it will return the basicDependencies tree, which we will later incorporate into NLTK (this is possible because each word has just one head).
pr=pln_inco.stanford_parser.lexicalized_parser_parse(
[my_sentence],
model='englishPCFG',output='basicDependencies')
for p in pr:
print p,'\n'
nsubjpass(submitted-7, Bills-1) prep(Bills-1, on-2) pobj(on-2, ports-3) cc(ports-3, and-4) conj(ports-3, immigration-5) auxpass(submitted-7, were-6) root(ROOT-0, submitted-7) prep(submitted-7, by-8) nn(Brownback.-10, Senator-9) pobj(by-8, Brownback.-10)
The only format that the dependencyGrammar module from NLTK accepts is Malt-TAB, from the Malt-parser, equivalent to the coNLL format. I will try to create a method to convert from Stanford representation to Malt-TAB representation. I need two pieces of info I do not still have: the POS tags, and the missing words (punctuation symbols). I will call the parser again, just to obtain the POS tags and tokens (I should better use the Stanford POS Tagger, but it is easier to call the parser).
import nltk.tag
tagged_sentences=pln_inco.stanford_parser.lexicalized_parser_tag(
[my_sentence],model='englishPCFG')
words_and_pos=[nltk.tag.str2tuple(t) for t in tagged_sentences[0].split()]
print words_and_pos
[('Bills', 'NNS'), ('on', 'IN'), ('ports', 'NNS'), ('and', 'CC'), ('immigration', 'NN'), ('were', 'VBD'), ('submitted', 'VBN'), ('by', 'IN'), ('Senator', 'NNP'), ('Brownback', 'NNP'), ('.', '.')]
Now I can build a Malt-tab specification on the dependencies generated by the parser, merged with words and pos-tags emergining for the pos-tagging process (I am sure that they are equally tokenized because I have used the same tool)
malt_tab_rep=pln_inco.syntax_trees.stanford_dependency_to_malt_tab(pr[0],words_and_pos)
print malt_tab_rep
['Bills', 'on', 'ports', 'and', 'immigration', 'were', 'submitted', 'by', 'Senator', 'Brownback', '.'] ['NNS', 'IN', 'NNS', 'CC', 'NN', 'VBD', 'VBN', 'IN', 'NNP', 'NNP', '.'] Bills NNS 7 nsubjpass on IN 1 prep ports NNS 2 pobj and CC 3 cc immigration NN 3 conj were VBD 7 auxpass submitted VBN 0 root by IN 7 prep Senator NNP 10 nn Brownback NNP 8 pobj
We now can create a NLTK representation, as we did in the previous section:
dg = nltk.dependencygraph.DependencyGraph(malt_tab_rep)
print dg.tree().pprint()
(submitted (Bills (on (ports and immigration))) were (by (Brownback Senator)))
And, of course, display it!
dep_tree=pln_inco.syntax_trees.dependency_to_dot(dg)
tree_png=Image(data=gv.generate(dep_tree,format='png'))
display_png(tree_png)