Notebook

Trees - the smooth path¶

We show the embedding of nodes annotated as sentences, clauses, phrases, subphrases and words. We put them in a format (eventually) such that they can be read by TGREP. Then Rens Bod and Andreas van Cranenburgh can do interesting business with it.

The method of tree construction has improved significantly since the writing of this notebook

we use more information from the ETCBC database
more sanity checks have been done
there is an attempt to define the steps from ETCBC data to trees in a precise manner.

See the notebook trees_etcbc4.

This notebook (trees) is preserved because a DOP parser by Andreas van Cranenburgh has been based on its output.

Method¶

We walk through all words and follow them upwards, along parents edges until there are no more outgoing edges. We then have the starting points for our sentences.

We walk through the starting points, and for each starting point we assemble the tree hanging off that point. This we do by walking the parents edges in the opposite direction.

We use the monad numbers (word numbers) to maintain word order.

parents links from words to phrases to clauses to sentences
word numbers

More details will follow below, when we deal with them.

Starting LAF-Fabric¶

In [1]:

import sys
import collections
import laf
from laf.fabric import LafFabric
from etcbc.preprocess import prepare
fabric = LafFabric()

  0.00s This is LAF-Fabric 4.2.8
http://laf-fabric.readthedocs.org/texts/API-reference.html

Declaring the features¶

We use the features in our data source for establishing the trees. This is what we need:

db.otype¶

The type of a node: word, phrase, sentence, etc

parents¶

This is a feature by which we can identify the edges that correspond to the parents relationship. parents goes from lower level to higher level (word => ... => sentence).

N.B. There are two linguistic hierarchies interwoven in this database. Sometimes nodes have more than one parent!

ft.text_plain¶

The unvocalized text of a word.

ft.part_of_speech¶

The part of speech of a word.

sft.verse-label¶

Passage information: book, chapter, verse all in one feature.

In [2]:

fabric.load('bhs3', '--', 'trees', {
    "xmlids": {"node": False, "edge": False},
    "features": ('''
        otype monads
        text_plain
        part_of_speech
        clause_constituent_relation phrase_type
        verse_label
    ''','''
        parents.
    '''),
    "prepare": prepare,
}, verbose='NORMAL')
exec(fabric.localnames.format(var='fabric'))

  0.00s LOADING API: please wait ... 
  0.00s INFO: DATA COMPILED AT: 2014-04-18T18-24-37
  5.06s LOGFILE=/Users/dirk/laf-fabric-data/bhs3/tasks/trees/__log__trees.txt
  5.76s INFO: DATA LOADED FROM SOURCE bhs3 AND ANNOX -- FOR TASK trees

Configuration¶

Here we define the formatting of the trees.

Relevant nodes¶

Not all nodes will be shown in the output. The nodes that are shown, have abbreviated names. Nodes with True will be shown, nodes with False will be suppressed.

Suppressing a node leaves its children in place. Another way of looking at it, is: we replace a node by its children.

Exception: when a node is visited twice, the second visit refers to the tree built by the first visit. In that case, we do not suppress the node.

N.B. It turns out that the -atom nodes are never visited twice.

pos_table¶

We abbreviate the part-of-speech tags. We include the pos-info by inserting a unary node right above each word.

In [3]:

relevant_nodes = [
    ("word", '', True),
    ("subphrase", 'SU', True),
    ("phrase_atom", 'Pa', False),
    ("phrase", 'P', True),
    ("clause_atom", 'Ca', False),
    ("clause", 'C', True),
    ("sentence_atom", 'Sa', False),
    ("sentence", 'S', True),
]

pos_table = {
 'adjective': 'aj',
 'adverb': 'av',
 'article': 'dt',
 'conjunction': 'cj',
 'interjection': 'ij',
 'interrogative': 'ir',
 'negative': 'ng',
 'noun': 'n',
 'preposition': 'pp',
 'pronoun': 'pr',
 'verb': 'vb',
}

select_node = set()
select_tag = collections.defaultdict(lambda: None)
abbrev_node = collections.defaultdict(lambda: None)

for (otype, abb, relevant) in relevant_nodes:
    if relevant:
        select_node.add(abb)
    abbrev_node[otype] = abb if abb != None else otype

Exploration¶

Currently it is not clear to me how to use the mother edges for the syntax trees.

The phrases are a bit undifferentiated, and the clauses to.

We could add the clause_constituent_relation to the syntax trees, and also the phrase_type and phrase_function.

The clause_constituent_relation is explored in the notebook clause_constituent_relation.ipynb, and the phrase features in the notebook phrase_typology.

Find the top nodes¶

We walk through the words. From each word we walk along the parent edges until we cannot get further. All end points are top nodes. We put the end nodes in a set. We join all sets of end nodes that we have found above each word.

N.B. In this way we encounter the top nodes many times, but it does not matter, because we put them all in a set, without duplicates.

In [4]:

msg("Looking for top nodes")
top_node_types = collections.defaultdict(lambda: 0)
top_nodes = set(C.parents_.endnodes(NN(test=F.otype.v, value='word')))

msg("Top nodes found: {}".format(len(top_nodes)))

  8.43s Looking for top nodes
    14s Top nodes found: 71354

Checking¶

Let us see what the types are of all the top nodes we have found.

We would like to see that they are all sentences.

And are all sentences top nodes?

In [5]:

top_node_types = collections.defaultdict(lambda: 0)

msg("Looking up tags for topnodes")
for node in NN(nodes=top_nodes):
    tag = abbrev_node[F.otype.v(node)]
    top_node_types[tag] += 1

for (otype, tag, relevant) in relevant_nodes:
    if top_node_types[tag]:
        msg("{:<2} {} x at the top".format(tag, top_node_types[tag]))

    17s Looking up tags for topnodes
    17s S  71354 x at the top

In [6]:

msg("Non top nodes of type S ... ")
nt = 0
for node in NN():
    if  F.otype.v(node) == "sentence" and C.parents_.e(node):
        msg("{} ".format(node), newline=False, withtime=False)
        nt += 1
        break
msg("Non top nodes of type S found: {}".format(nt))

    20s Non top nodes of type S ... 
    22s Non top nodes of type S found: 0

Serializing¶

For each top node, we serialize its tree along the inverse parents edges. There are confluences, meaning that sometimes one node has two parents. We detect that, and the second time we reach a node, we output a token that references to the tree we constructed in the first visit to that node.

We cannot write the string immediately, because after tree creation we want to renumber referenced trees and word occurrences.

Output format¶

We output the trees in the order as their texts occur in the Hebrew bible. The output is a text file, and every line corresponds to a exactly one tree.

Every line has three tab-separated fields.

passage label
tree structure with placeholders for the words
word sequence (linking words to place holders). This sequence corresponds to the order as found in the text.

In [7]:

trees = outfile("trees-simple.txt")

nodes_seen = set()
words = []
sequential = []

def write_tree(node):
    
    if node in nodes_seen:
        return
    
    nodes_seen.add(node)

    otype = F.otype.v(node)
    tag = abbrev_node[otype]
    relevant = tag in select_node

    if tag == 'C':
        crr = F.clause_constituent_relation.v(node)
        if crr != 'none':
            tag = crr
    elif tag == 'P':
        tag = F.phrase_type.v(node)
    is_word = otype == 'word'
    if is_word:
        text = F.text_plain.v(node)
        pos = pos_table[F.part_of_speech.v(node)]
        monad = int(F.monads.v(node))
        sequential.append(("W", len(words)))
        words.append((monad, text, pos))
    else:
        sequential.append(("O" if relevant else "N", tag))
    
    for child in Ci.parents_.v(node, sort=True):
        write_tree(child)
    
    if not is_word:
        sequential.append(("C" if relevant else "N", tag))

def do_sequential():
    word_perm = {}
    new_words = sorted(enumerate(words), key=lambda x: x[1][0])
    word_reps = []
    for (nn, (on, (monad, text, pos))) in enumerate(new_words):
        word_perm[on] = nn
        word_reps.append(text)
    word_rep = ' '.join(word_reps)
                    
    for (code, info) in sequential:
        if code == 'O' or code == 'C':
            if code == 'O':
                trees.write('({}'.format(info))
            else:
                trees.write(')')
        elif code == 'W':
            nn = word_perm[info]
            pos = words[info][2]
            trees.write('({} {})'.format(pos, nn))
    
    trees.write("\t{}".format(word_rep))
    
msg("Writing trees ...")
verse_label = ''

s = 0
chunk = 10000
sc = 0

msg("making nodeset")
msg("processing topnodes")
for node in NN(nodes=top_nodes | set(NN(test=F.otype.v, value='verse'))):
    otype = F.otype.v(node)
    if  otype == 'verse':
        verse_label = F.verse_label.v(node)
        continue
    nodes_seen = set()
    sequential = []
    words = []
    write_tree(node)
    do_sequential()
    trees.write("\t{}\n".format(verse_label))
    s += 1
    sc += 1
    if sc == chunk:
        msg("{} trees written".format(s))
        sc = 0
    
msg("{} trees written".format(s))

 1m 13s Writing trees ...
 1m 13s making nodeset
 1m 13s processing topnodes
 1m 17s 10000 trees written
 1m 21s 20000 trees written
 1m 24s 30000 trees written
 1m 27s 40000 trees written
 1m 30s 50000 trees written
 1m 32s 60000 trees written
 1m 34s 70000 trees written
 1m 35s 71354 trees written

In [8]:

close()

 1m 38s Results directory:
/Users/dirk/laf-fabric-data/bhs3/tasks/trees

.DS_Store                              6148 Wed Apr 30 16:30:02 2014
.trees2014-04-30.txt.swp             110592 Tue May 27 15:52:45 2014
__log__trees.txt                        686 Tue May 27 15:54:22 2014
coor.txt                              70830 Wed Apr 30 20:57:45 2014
depths.txt                           714284 Wed Apr 30 20:57:44 2014
objects-deut-new.txt                    966 Fri Apr 25 09:53:16 2014
objects-deut-old.txt                    966 Fri Apr 25 09:51:15 2014
objects.txt                           15416 Fri Apr 25 09:47:18 2014
tgrep_result.txt                    4916101 Wed Apr 30 20:57:37 2014
tree_notabene-2014-04-30.txt         111037 Wed Apr 30 20:51:21 2014
tree_notabene.txt                    114212 Tue May 27 12:27:23 2014
trees-nocoor.txt                      13914 Mon Apr 28 09:59:30 2014
trees-nosisters.txt                   13928 Mon Apr 28 09:59:30 2014
trees-notransform.txt                   501 Mon Apr 28 09:47:29 2014
trees-simple.txt                    8310229 Tue May 27 15:54:22 2014
trees-transformed.txt                   371 Mon Apr 28 09:47:43 2014
trees.t2c                          12361153 Wed Apr 30 20:57:22 2014
trees.txt                          10621788 Tue May 27 12:28:15 2014
trees2014-04-30.txt                10626155 Wed Apr 30 20:56:26 2014
trees_fixed_20.txt                   150015 Tue May 27 12:27:23 2014
trees_random_20-2014-04-30.txt       150015 Wed Apr 30 20:51:21 2014
trees_random_20.txt                  108048 Tue May 27 12:27:23 2014

Preview¶

Here are the first lines of the output.

In [9]:

!head -n 25 {my_file('trees-simple.txt')}

(S(C(PP(pp 0)(n 1))(VP(vb 2))(NP(n 3))(PP(SU(pp 4)(dt 5)(n 6))(cj 7)(SU(pp 8)(dt 9)(n 10)))))	ב ראשׁית ברא אלהים את ה שׁמים ו את ה ארץ	 GEN 01,01
(S(C(CP(cj 0))(NP(dt 1)(n 2))(VP(vb 3))(NP(SU(n 4))(cj 5)(SU(n 6)))))	ו ה ארץ היתה תהו ו בהו	 GEN 01,02
(S(C(CP(cj 0))(NP(n 1))(PP(pp 2)(SU(n 3))(SU(n 4)))))	ו חשׁך על פני תהום	 GEN 01,02
(S(C(CP(cj 0))(NP(SU(n 1))(SU(n 2)))(VP(vb 3))(PP(pp 4)(SU(n 5))(SU(dt 6)(n 7)))))	ו רוח אלהים מרחפת על פני ה מים	 GEN 01,02
(S(C(CP(cj 0))(VP(vb 1))(NP(n 2))))	ו יאמר אלהים	 GEN 01,03
(S(C(VP(vb 0))(NP(n 1))))	יהי אור	 GEN 01,03
(S(C(CP(cj 0))(VP(vb 1))(NP(n 2))))	ו יהי אור	 GEN 01,03
(S(C(CP(cj 0))(VP(vb 1))(NP(n 2))(PP(pp 3)(dt 4)(n 5)))(Objc(CP(cj 6))(VP(vb 7))))	ו ירא אלהים את ה אור כי טוב	 GEN 01,04
(S(C(CP(cj 0))(VP(vb 1))(NP(n 2))(PP(SU(n 3)(dt 4)(n 5))(cj 6)(SU(n 7)(dt 8)(n 9)))))	ו יבדל אלהים בין ה אור ו בין ה חשׁך	 GEN 01,04
(S(C(CP(cj 0))(VP(vb 1))(NP(n 2))(PP(pp 3)(dt 4)(n 5))(NP(n 6))))	ו יקרא אלהים ל  אור יום	 GEN 01,05
(S(C(CP(cj 0))(PP(pp 1)(dt 2)(n 3))(VP(vb 4))(NP(n 5))))	ו ל  חשׁך קרא לילה	 GEN 01,05
(S(C(CP(cj 0))(VP(vb 1))(NP(n 2))))	ו יהי ערב	 GEN 01,05
(S(C(CP(cj 0))(VP(vb 1))(NP(n 2))))	ו יהי בקר	 GEN 01,05
(S(C(NP(SU(n 0))(SU(n 1)))))	יום אחד	 GEN 01,05
(S(C(CP(cj 0))(VP(vb 1))(NP(n 2))))	ו יאמר אלהים	 GEN 01,06
(S(C(VP(vb 0))(NP(n 1))(PP(pp 2)(SU(n 3))(SU(dt 4)(n 5)))))	יהי רקיע ב תוך ה מים	 GEN 01,06
(S(C(CP(cj 0))(VP(vb 1))(VP(vb 2))(PP(n 3)(n 4)(pp 5)(n 6))))	ו יהי מבדיל בין מים ל מים	 GEN 01,06
(S(C(CP(cj 0))(VP(vb 1))(NP(n 2))(PP(pp 3)(dt 4)(n 5))))	ו יעשׂ אלהים את ה רקיע	 GEN 01,07
(S(C(CP(cj 0))(VP(vb 1))(PP(n 2)(dt 3)(n 4))(CP(cj 11))(PP(n 12)(dt 13)(n 14)))(Attr(CP(cj 5))(PP(pp 6)(n 7)(pp 8)(dt 9)(n 10)))(Attr(CP(cj 15))(PP(pp 16)(pp 17)(pp 18)(dt 19)(n 20))))	ו יבדל בין ה מים אשׁר מ תחת ל  רקיע ו בין ה מים אשׁר מ על ל  רקיע	 GEN 01,07
(S(C(CP(cj 0))(VP(vb 1))(AdvP(av 2))))	ו יהי כן	 GEN 01,07
(S(C(CP(cj 0))(VP(vb 1))(NP(n 2))(PP(pp 3)(dt 4)(n 5))(NP(n 6))))	ו יקרא אלהים ל  רקיע שׁמים	 GEN 01,08
(S(C(CP(cj 0))(VP(vb 1))(NP(n 2))))	ו יהי ערב	 GEN 01,08
(S(C(CP(cj 0))(VP(vb 1))(NP(n 2))))	ו יהי בקר	 GEN 01,08
(S(C(NP(SU(n 0))(SU(aj 1)))))	יום שׁני	 GEN 01,08
(S(C(CP(cj 0))(VP(vb 1))(NP(n 2))))	ו יאמר אלהים	 GEN 01,09

In [ ]: