We show the embedding of nodes annotated as sentences, clauses, phrases, subphrases and words. We put them in a format (eventually) such that they can be read by TGREP. Then Rens Bod and Andreas van Cranenburgh can do interesting business with it.
The method of tree construction has improved significantly since the writing of this notebookSee the notebook trees_etcbc4.
This notebook (trees) is preserved because a DOP parser by Andreas van Cranenburgh has been based on its output.
We walk through all words and follow them upwards, along parents edges until there are no more outgoing edges. We then have the starting points for our sentences.
We walk through the starting points, and for each starting point we assemble the tree hanging off that point. This we do by walking the parents edges in the opposite direction.
We use the monad numbers (word numbers) to maintain word order.
More details will follow below, when we deal with them.
import sys
import collections
import laf
from laf.fabric import LafFabric
from etcbc.preprocess import prepare
fabric = LafFabric()
0.00s This is LAF-Fabric 4.2.8 http://laf-fabric.readthedocs.org/texts/API-reference.html
We use the features in our data source for establishing the trees. This is what we need:
The type of a node: word, phrase, sentence, etc
This is a feature by which we can identify the edges that correspond to the parents relationship. parents goes from lower level to higher level (word => ... => sentence).
N.B. There are two linguistic hierarchies interwoven in this database. Sometimes nodes have more than one parent!
The unvocalized text of a word.
The part of speech of a word.
Passage information: book, chapter, verse all in one feature.
fabric.load('bhs3', '--', 'trees', {
"xmlids": {"node": False, "edge": False},
"features": ('''
otype monads
text_plain
part_of_speech
clause_constituent_relation phrase_type
verse_label
''','''
parents.
'''),
"prepare": prepare,
}, verbose='NORMAL')
exec(fabric.localnames.format(var='fabric'))
0.00s LOADING API: please wait ... 0.00s INFO: DATA COMPILED AT: 2014-04-18T18-24-37 5.06s LOGFILE=/Users/dirk/laf-fabric-data/bhs3/tasks/trees/__log__trees.txt 5.76s INFO: DATA LOADED FROM SOURCE bhs3 AND ANNOX -- FOR TASK trees
Here we define the formatting of the trees.
Not all nodes will be shown in the output.
The nodes that are shown, have abbreviated names.
Nodes with True
will be shown, nodes with False
will be suppressed.
Suppressing a node leaves its children in place. Another way of looking at it, is: we replace a node by its children.
Exception: when a node is visited twice, the second visit refers to the tree built by the first visit. In that case, we do not suppress the node.
N.B. It turns out that the -atom
nodes are never visited twice.
We abbreviate the part-of-speech tags. We include the pos-info by inserting a unary node right above each word.
relevant_nodes = [
("word", '', True),
("subphrase", 'SU', True),
("phrase_atom", 'Pa', False),
("phrase", 'P', True),
("clause_atom", 'Ca', False),
("clause", 'C', True),
("sentence_atom", 'Sa', False),
("sentence", 'S', True),
]
pos_table = {
'adjective': 'aj',
'adverb': 'av',
'article': 'dt',
'conjunction': 'cj',
'interjection': 'ij',
'interrogative': 'ir',
'negative': 'ng',
'noun': 'n',
'preposition': 'pp',
'pronoun': 'pr',
'verb': 'vb',
}
select_node = set()
select_tag = collections.defaultdict(lambda: None)
abbrev_node = collections.defaultdict(lambda: None)
for (otype, abb, relevant) in relevant_nodes:
if relevant:
select_node.add(abb)
abbrev_node[otype] = abb if abb != None else otype
Currently it is not clear to me how to use the mother edges for the syntax trees.
The phrases are a bit undifferentiated, and the clauses to.
We could add the clause_constituent_relation to the syntax trees, and also the phrase_type and phrase_function.
The clause_constituent_relation is explored in the notebook clause_constituent_relation.ipynb, and the phrase features in the notebook phrase_typology.
We walk through the words. From each word we walk along the parent edges until we cannot get further. All end points are top nodes. We put the end nodes in a set. We join all sets of end nodes that we have found above each word.
N.B. In this way we encounter the top nodes many times, but it does not matter, because we put them all in a set, without duplicates.
msg("Looking for top nodes")
top_node_types = collections.defaultdict(lambda: 0)
top_nodes = set(C.parents_.endnodes(NN(test=F.otype.v, value='word')))
msg("Top nodes found: {}".format(len(top_nodes)))
8.43s Looking for top nodes 14s Top nodes found: 71354
Let us see what the types are of all the top nodes we have found.
We would like to see that they are all sentences.
And are all sentences top nodes?
top_node_types = collections.defaultdict(lambda: 0)
msg("Looking up tags for topnodes")
for node in NN(nodes=top_nodes):
tag = abbrev_node[F.otype.v(node)]
top_node_types[tag] += 1
for (otype, tag, relevant) in relevant_nodes:
if top_node_types[tag]:
msg("{:<2} {} x at the top".format(tag, top_node_types[tag]))
17s Looking up tags for topnodes 17s S 71354 x at the top
msg("Non top nodes of type S ... ")
nt = 0
for node in NN():
if F.otype.v(node) == "sentence" and C.parents_.e(node):
msg("{} ".format(node), newline=False, withtime=False)
nt += 1
break
msg("Non top nodes of type S found: {}".format(nt))
20s Non top nodes of type S ... 22s Non top nodes of type S found: 0
For each top node, we serialize its tree along the inverse parents edges. There are confluences, meaning that sometimes one node has two parents. We detect that, and the second time we reach a node, we output a token that references to the tree we constructed in the first visit to that node.
We cannot write the string immediately, because after tree creation we want to renumber referenced trees and word occurrences.
We output the trees in the order as their texts occur in the Hebrew bible. The output is a text file, and every line corresponds to a exactly one tree.
Every line has three tab-separated fields.
trees = outfile("trees-simple.txt")
nodes_seen = set()
words = []
sequential = []
def write_tree(node):
if node in nodes_seen:
return
nodes_seen.add(node)
otype = F.otype.v(node)
tag = abbrev_node[otype]
relevant = tag in select_node
if tag == 'C':
crr = F.clause_constituent_relation.v(node)
if crr != 'none':
tag = crr
elif tag == 'P':
tag = F.phrase_type.v(node)
is_word = otype == 'word'
if is_word:
text = F.text_plain.v(node)
pos = pos_table[F.part_of_speech.v(node)]
monad = int(F.monads.v(node))
sequential.append(("W", len(words)))
words.append((monad, text, pos))
else:
sequential.append(("O" if relevant else "N", tag))
for child in Ci.parents_.v(node, sort=True):
write_tree(child)
if not is_word:
sequential.append(("C" if relevant else "N", tag))
def do_sequential():
word_perm = {}
new_words = sorted(enumerate(words), key=lambda x: x[1][0])
word_reps = []
for (nn, (on, (monad, text, pos))) in enumerate(new_words):
word_perm[on] = nn
word_reps.append(text)
word_rep = ' '.join(word_reps)
for (code, info) in sequential:
if code == 'O' or code == 'C':
if code == 'O':
trees.write('({}'.format(info))
else:
trees.write(')')
elif code == 'W':
nn = word_perm[info]
pos = words[info][2]
trees.write('({} {})'.format(pos, nn))
trees.write("\t{}".format(word_rep))
msg("Writing trees ...")
verse_label = ''
s = 0
chunk = 10000
sc = 0
msg("making nodeset")
msg("processing topnodes")
for node in NN(nodes=top_nodes | set(NN(test=F.otype.v, value='verse'))):
otype = F.otype.v(node)
if otype == 'verse':
verse_label = F.verse_label.v(node)
continue
nodes_seen = set()
sequential = []
words = []
write_tree(node)
do_sequential()
trees.write("\t{}\n".format(verse_label))
s += 1
sc += 1
if sc == chunk:
msg("{} trees written".format(s))
sc = 0
msg("{} trees written".format(s))
1m 13s Writing trees ... 1m 13s making nodeset 1m 13s processing topnodes 1m 17s 10000 trees written 1m 21s 20000 trees written 1m 24s 30000 trees written 1m 27s 40000 trees written 1m 30s 50000 trees written 1m 32s 60000 trees written 1m 34s 70000 trees written 1m 35s 71354 trees written
close()
1m 38s Results directory: /Users/dirk/laf-fabric-data/bhs3/tasks/trees .DS_Store 6148 Wed Apr 30 16:30:02 2014 .trees2014-04-30.txt.swp 110592 Tue May 27 15:52:45 2014 __log__trees.txt 686 Tue May 27 15:54:22 2014 coor.txt 70830 Wed Apr 30 20:57:45 2014 depths.txt 714284 Wed Apr 30 20:57:44 2014 objects-deut-new.txt 966 Fri Apr 25 09:53:16 2014 objects-deut-old.txt 966 Fri Apr 25 09:51:15 2014 objects.txt 15416 Fri Apr 25 09:47:18 2014 tgrep_result.txt 4916101 Wed Apr 30 20:57:37 2014 tree_notabene-2014-04-30.txt 111037 Wed Apr 30 20:51:21 2014 tree_notabene.txt 114212 Tue May 27 12:27:23 2014 trees-nocoor.txt 13914 Mon Apr 28 09:59:30 2014 trees-nosisters.txt 13928 Mon Apr 28 09:59:30 2014 trees-notransform.txt 501 Mon Apr 28 09:47:29 2014 trees-simple.txt 8310229 Tue May 27 15:54:22 2014 trees-transformed.txt 371 Mon Apr 28 09:47:43 2014 trees.t2c 12361153 Wed Apr 30 20:57:22 2014 trees.txt 10621788 Tue May 27 12:28:15 2014 trees2014-04-30.txt 10626155 Wed Apr 30 20:56:26 2014 trees_fixed_20.txt 150015 Tue May 27 12:27:23 2014 trees_random_20-2014-04-30.txt 150015 Wed Apr 30 20:51:21 2014 trees_random_20.txt 108048 Tue May 27 12:27:23 2014
Here are the first lines of the output.
!head -n 25 {my_file('trees-simple.txt')}
(S(C(PP(pp 0)(n 1))(VP(vb 2))(NP(n 3))(PP(SU(pp 4)(dt 5)(n 6))(cj 7)(SU(pp 8)(dt 9)(n 10))))) ב ראשׁית ברא אלהים את ה שׁמים ו את ה ארץ GEN 01,01 (S(C(CP(cj 0))(NP(dt 1)(n 2))(VP(vb 3))(NP(SU(n 4))(cj 5)(SU(n 6))))) ו ה ארץ היתה תהו ו בהו GEN 01,02 (S(C(CP(cj 0))(NP(n 1))(PP(pp 2)(SU(n 3))(SU(n 4))))) ו חשׁך על פני תהום GEN 01,02 (S(C(CP(cj 0))(NP(SU(n 1))(SU(n 2)))(VP(vb 3))(PP(pp 4)(SU(n 5))(SU(dt 6)(n 7))))) ו רוח אלהים מרחפת על פני ה מים GEN 01,02 (S(C(CP(cj 0))(VP(vb 1))(NP(n 2)))) ו יאמר אלהים GEN 01,03 (S(C(VP(vb 0))(NP(n 1)))) יהי אור GEN 01,03 (S(C(CP(cj 0))(VP(vb 1))(NP(n 2)))) ו יהי אור GEN 01,03 (S(C(CP(cj 0))(VP(vb 1))(NP(n 2))(PP(pp 3)(dt 4)(n 5)))(Objc(CP(cj 6))(VP(vb 7)))) ו ירא אלהים את ה אור כי טוב GEN 01,04 (S(C(CP(cj 0))(VP(vb 1))(NP(n 2))(PP(SU(n 3)(dt 4)(n 5))(cj 6)(SU(n 7)(dt 8)(n 9))))) ו יבדל אלהים בין ה אור ו בין ה חשׁך GEN 01,04 (S(C(CP(cj 0))(VP(vb 1))(NP(n 2))(PP(pp 3)(dt 4)(n 5))(NP(n 6)))) ו יקרא אלהים ל אור יום GEN 01,05 (S(C(CP(cj 0))(PP(pp 1)(dt 2)(n 3))(VP(vb 4))(NP(n 5)))) ו ל חשׁך קרא לילה GEN 01,05 (S(C(CP(cj 0))(VP(vb 1))(NP(n 2)))) ו יהי ערב GEN 01,05 (S(C(CP(cj 0))(VP(vb 1))(NP(n 2)))) ו יהי בקר GEN 01,05 (S(C(NP(SU(n 0))(SU(n 1))))) יום אחד GEN 01,05 (S(C(CP(cj 0))(VP(vb 1))(NP(n 2)))) ו יאמר אלהים GEN 01,06 (S(C(VP(vb 0))(NP(n 1))(PP(pp 2)(SU(n 3))(SU(dt 4)(n 5))))) יהי רקיע ב תוך ה מים GEN 01,06 (S(C(CP(cj 0))(VP(vb 1))(VP(vb 2))(PP(n 3)(n 4)(pp 5)(n 6)))) ו יהי מבדיל בין מים ל מים GEN 01,06 (S(C(CP(cj 0))(VP(vb 1))(NP(n 2))(PP(pp 3)(dt 4)(n 5)))) ו יעשׂ אלהים את ה רקיע GEN 01,07 (S(C(CP(cj 0))(VP(vb 1))(PP(n 2)(dt 3)(n 4))(CP(cj 11))(PP(n 12)(dt 13)(n 14)))(Attr(CP(cj 5))(PP(pp 6)(n 7)(pp 8)(dt 9)(n 10)))(Attr(CP(cj 15))(PP(pp 16)(pp 17)(pp 18)(dt 19)(n 20)))) ו יבדל בין ה מים אשׁר מ תחת ל רקיע ו בין ה מים אשׁר מ על ל רקיע GEN 01,07 (S(C(CP(cj 0))(VP(vb 1))(AdvP(av 2)))) ו יהי כן GEN 01,07 (S(C(CP(cj 0))(VP(vb 1))(NP(n 2))(PP(pp 3)(dt 4)(n 5))(NP(n 6)))) ו יקרא אלהים ל רקיע שׁמים GEN 01,08 (S(C(CP(cj 0))(VP(vb 1))(NP(n 2)))) ו יהי ערב GEN 01,08 (S(C(CP(cj 0))(VP(vb 1))(NP(n 2)))) ו יהי בקר GEN 01,08 (S(C(NP(SU(n 0))(SU(aj 1))))) יום שׁני GEN 01,08 (S(C(CP(cj 0))(VP(vb 1))(NP(n 2)))) ו יאמר אלהים GEN 01,09