It is possible to make your own annotations to the existing nodes and edges of a LAF resource, and use them later on.
This notebook shows how that can be done.
The original LAF source will not be changed.
You make your own annotations and they will be saved in a file. By putting this file in a location where LAF-Fabric can find it, and by instructing LAF-Fabric to include this file, you can do analysis on the original source plus your new annotations.
Annotations are organized by annotation spaces. If you choose a space that is different from the annotation spaces inthe main source your own annotations will be distinguishable from the original annotations.
But you can also choose to override original annotations with your own ones.
In that case you have to create your annotations in the shebanq
space.
This notebook is not honed by practice yet. There are clumsy things such as manually copy files and putting them into directories, and editing certain header files.
That said, this notebook performs the LAF specific things, and further adaptations do not involve deep dives into the LAF-Fabric.
In order to run this notebook, it is necessary to have an extra annotations package called testannots
on your system.
If you have downloaded data from the given link, you have that directory.
import sys
import collections
import shutil
import pandas
from IPython.display import display
pandas.set_option('display.notebook_repr_html', True)
from laf.fabric import LafFabric
from etcbc.annotating import GenForm
fabric = LafFabric()
0.00s This is LAF-Fabric 4.3.3 http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html
API = fabric.load('etcbc4', '--', 'annox_workflow', {
"primary": True,
"xmlids": {"node": True, "edge": False},
"features": ("otype oid monads typ sp book chapter verse", ""),
}, verbose='DETAIL')
exec(fabric.localnames.format(var='fabric'))
0.00s LOADING API: please wait ... 0.00s DETAIL: COMPILING m: UP TO DATE 0.00s INFO: USING DATA COMPILED AT: 2014-07-14T16-45-08 0.00s DETAIL: COMPILING a: UP TO DATE 0.01s DETAIL: load main: P.node_anchor 0.06s DETAIL: load main: P.node_anchor_items 0.30s DETAIL: load main: G.node_anchor_min 0.36s DETAIL: load main: G.node_anchor_max 0.41s DETAIL: load main: P.node_events 0.49s DETAIL: load main: P.node_events_items 0.77s DETAIL: load main: P.node_events_k 0.85s DETAIL: load main: P.node_events_n 0.99s DETAIL: load main: G.node_sort 1.04s DETAIL: load main: G.node_sort_inv 1.44s DETAIL: load main: G.edges_from 1.51s DETAIL: load main: G.edges_to 1.58s DETAIL: load main: P.primary_data 1.63s DETAIL: load main: X. [node] -> 2.80s DETAIL: load main: X. [node] <- 3.51s DETAIL: load main: F.etcbc4_db_monads [node] 4.44s DETAIL: load main: F.etcbc4_db_oid [node] 5.30s DETAIL: load main: F.etcbc4_db_otype [node] 5.99s DETAIL: load main: F.etcbc4_ft_sp [node] 6.20s DETAIL: load main: F.etcbc4_ft_typ [node] 6.56s DETAIL: load main: F.etcbc4_sft_book [node] 6.58s DETAIL: load main: F.etcbc4_sft_chapter [node] 6.59s DETAIL: load main: F.etcbc4_sft_verse [node] 6.61s LOGFILE=/Users/dirk/laf-fabric-output/etcbc4/annox_workflow/__log__annox_workflow.txt 6.61s INFO: DATA LOADED FROM SOURCE etcbc4 AND ANNOX -- FOR TASK annox_workflow AT 2014-07-15T13-57-51
This is the workflow for new annotation data:
In the dictionary config below, you can specify a spreadsheet with rows and colomns.
On each row you see the textual representation of the objects you specify, from the passages you specify. You can also ask for other columns with feature information for reference. There are extra columns for new features, with names that you specify.
The columns contain the following information
dirk_part_intro
dirk_part_role
form = GenForm(API, "dirk_intro_role", {
'target_types': [
'word',
'phrase',
],
'show_features': {
'etcbc4': {
'node': [
"ft.typ,sp",
],
},
},
'new_features': {
'dirk': {
'node': [
"part.intro,role",
],
},
},
'passages': {
'Genesis': '1-3',
'Jesaia': '40,66',
},
})
Run the make_form function:
form.make_form()
8.02s Reading the data ... Genesis1,2,3,**********Jesaia40,66,*************************** 9.79s Done
Look at the form in its text form:
form_data = pandas.read_csv(my_file("form_{}.csv".format(form.name)), sep='\t', na_filter=False)
form_data.head(15)
passage | word | phrase | typ | sp | dirk:part.intro | dirk:part.role | |
---|---|---|---|---|---|---|---|
0 | #Genesis 1:1 | ||||||
1 | n7 | בְּרֵאשִׁ֖ית | PP | ||||
2 | n5 | בְּ | prep | ||||
3 | n12 | רֵאשִׁ֖ית | subs | ||||
4 | n13 | בָּרָ֣א | verb | ||||
5 | n15 | בָּרָ֣א | VP | ||||
6 | n16 | אֱלֹהִ֑ים | subs | ||||
7 | n18 | אֱלֹהִ֑ים | NP | ||||
8 | n22 | אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ | PP | ||||
9 | n20 | אֵ֥ת | prep | ||||
10 | n23 | הַ | art | ||||
11 | n24 | שָּׁמַ֖יִם | subs | ||||
12 | n26 | וְ | conj | ||||
13 | n27 | אֵ֥ת | prep | ||||
14 | n28 | הָ | art |
15 rows × 7 columns
It is time to fill in your form. First we rename it, so that when you create a new blank form, it will not overwrite any data you have already filled in.
We replace the form
prefix in your file name by data
.
my_form = my_file("form_{}.csv".format(form.name))
my_data = my_file("data_{}.csv".format(form.name))
shutil.move(my_form, my_data)
'/Users/dirk/laf-fabric-output/etcbc4/annox_workflow/data_dirk_intro_role.csv'
Open the data
file as a spreadsheet.
OpenOffice is recommended for that, because it handles unicode well.
Fill any your feature values as desired and save.
On Mac OS X we can open the spreadsheet straight away:
!open /Applications/OpenOffice.app --args {my_data}
Here you see the latest content of the form:
form_data = pandas.read_csv(my_file("data_{}.csv".format(form.name)), sep='\t', na_filter=False)
form_data.head(15)
passage | word | phrase | typ | sp | dirk:part.intro | dirk:part.role | |
---|---|---|---|---|---|---|---|
0 | #Genesis 1:1 | ||||||
1 | n7 | בְּרֵאשִׁ֖ית | PP | ||||
2 | n5 | בְּ | prep | ||||
3 | n12 | רֵאשִׁ֖ית | subs | ||||
4 | n13 | בָּרָ֣א | verb | ||||
5 | n15 | בָּרָ֣א | VP | ||||
6 | n16 | אֱלֹהִ֑ים | subs | ||||
7 | n18 | אֱלֹהִ֑ים | NP | ||||
8 | n22 | אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ | PP | ||||
9 | n20 | אֵ֥ת | prep | aap | |||
10 | n23 | הַ | art | ||||
11 | n24 | שָּׁמַ֖יִם | subs | ||||
12 | n26 | וְ | conj | ||||
13 | n27 | אֵ֥ת | prep | noot | |||
14 | n28 | הָ | art |
15 rows × 7 columns
The data
file is going to be turned into an XML file with annotations to the original LAF resource.
Here is the function doing that.
form.make_annots()
my_annots = my_file("annot_{}.xml".format(form.name))
Look at the freshly created annotations.
!cat {my_annots}
<?xml version="1.0" encoding="UTF-8"?> <graph xmlns="http://www.xces.org/ns/GrAF/1.0/" xmlns:graf="http://www.xces.org/ns/GrAF/1.0/"> <graphHeader> <labelsDecl/> <dependencies/> <annotationSpaces/> </graphHeader> <a xml:id="a1" as="dirk" label="part" ref="n27"><fs> <f name="role" value="noot"/> </fs></a> <a xml:id="a2" as="dirk" label="part" ref="n47"><fs> <f name="intro" value="mies"/> <f name="role" value="karel"/> </fs></a> <a xml:id="a3" as="dirk" label="part" ref="n20"><fs> <f name="intro" value="aap"/> </fs></a> </graph>
Here is the directory, we call it the annox directory.
fabric.lafapi.names.env['a_source_dir'][0:-3]
'/Users/dirk/laf-fabric-data/etcbc4/annotations'
Extra annotations are organized in packages. You can place multiple new annotation files in a package.
Let us add this file to the existing package called testannots
.
It is the annox directory.
We put the new file in this directory:
shutil.copy(
my_annots,
"{}/testparticipants".format(fabric.lafapi.names.env['a_source_dir'][0:-3])
)
'/Users/dirk/laf-fabric-data/etcbc4/annotations/testparticipants/annot_dirk_intro_role.xml'
If you have more files, you can place them in the same directory, and for each file you have to add a line to the header file in that directory, like the existing line, resulting in:
<annotation f.id="f_dirk1" loc="annot_dirk_intro_role.xml"/>
<annotation f.id="f_dirk2" loc="annot_dirk_other.xml"/>
The files in the package must be mentioned in the header file, that's the point.
You can invoke an extra annotation package by mentioning it in the statement where you initialize the processor.
The processor looks for the package, checks whether it has to be compiled, and if so, compiles it.
Then the data corresponding to testannots
is loaded after the main source has been loaded.
What ever task you perform, it will have access to the new annotations.
fabric.load('etcbc4', 'testparticipants', 'annox_workflow', {
"primary": True,
"xmlids": {
"node": True,
"edge": False,
},
"features": {
"etcbc4": {
"node": [
"db.otype",
"sft.label",
],
"edge": [
],
},
"dirk": {
"node": [
"part.intro,role",
],
}
},
})
exec(fabric.localnames.format(var='fabric'))
0.00s LOADING API: please wait ... 0.00s DETAIL: COMPILING m: UP TO DATE 0.00s INFO: USING DATA COMPILED AT: 2014-07-14T16-45-08 0.00s BEGIN COMPILE a: testparticipants 0.00s DETAIL: load main: X. [node] -> 1.35s DETAIL: load main: X. [e] -> 3.51s DETAIL: load main: G.node_anchor_min 3.57s DETAIL: load main: G.node_anchor_max 3.62s DETAIL: load main: G.node_sort 3.67s DETAIL: load main: G.node_sort_inv 4.21s DETAIL: load main: G.edges_from 4.28s DETAIL: load main: G.edges_to 4.35s LOGFILE=/Users/dirk/laf-fabric-data/etcbc4/bin/A/testparticipants/__log__compile__.txt 4.35s PARSING ANNOTATION FILES 4.36s INFO: parsing annot_dirk_intro_role.xml 4.36s INFO: END PARSING 0 good regions and 0 faulty ones 0 linked nodes and 0 unlinked ones 0 good edges and 0 faulty ones 3 good annots and 0 faulty ones 4 good features and 0 faulty ones 3 distinct xml identifiers 4.36s MODELING RESULT FILES 4.36s INFO: CONNECTIVITY 4.55s WRITING RESULT FILES for a 4.56s DETAIL: write annox: F.dirk_part_intro [node] 4.56s DETAIL: write annox: F.dirk_part_role [node] 4.56s END COMPILE a: testparticipants 4.66s INFO: USING DATA COMPILED AT: 2014-07-15T13-59-19 4.66s DETAIL: keep main: P.node_anchor 4.66s DETAIL: keep main: P.node_anchor_items 4.66s DETAIL: keep main: G.node_anchor_min 4.66s DETAIL: keep main: G.node_anchor_max 4.67s DETAIL: keep main: P.node_events 4.67s DETAIL: keep main: P.node_events_items 4.67s DETAIL: keep main: P.node_events_k 4.67s DETAIL: keep main: P.node_events_n 4.67s DETAIL: keep main: G.node_sort 4.67s DETAIL: keep main: G.node_sort_inv 4.67s DETAIL: keep main: G.edges_from 4.67s DETAIL: keep main: G.edges_to 4.67s DETAIL: keep main: P.primary_data 4.67s DETAIL: keep main: X. [node] -> 4.67s DETAIL: keep main: X. [node] <- 4.67s DETAIL: keep main: F.etcbc4_db_otype [node] 4.67s DETAIL: clear main: F.etcbc4_db_monads [node] 4.67s DETAIL: clear main: F.etcbc4_db_oid [node] 4.67s DETAIL: clear main: F.etcbc4_ft_sp [node] 4.67s DETAIL: clear main: F.etcbc4_ft_typ [node] 4.67s DETAIL: clear main: F.etcbc4_sft_book [node] 4.67s DETAIL: clear main: F.etcbc4_sft_chapter [node] 4.67s DETAIL: clear main: F.etcbc4_sft_verse [node] 4.67s DETAIL: load main: F.dirk_part_intro [node] 4.68s DETAIL: load main: F.dirk_part_role [node] 4.68s DETAIL: load main: F.etcbc4_sft_label [node] 4.69s DETAIL: load annox: F.dirk_part_intro [node] 4.70s DETAIL: load annox: F.dirk_part_role [node] 4.70s DETAIL: load annox: F.etcbc4_db_otype [node] 4.70s DETAIL: load annox: F.etcbc4_sft_label [node] 4.70s INFO: DATA LOADED FROM SOURCE etcbc4 AND ANNOX testparticipants FOR TASK annox_workflow AT 2014-07-15T13-59-19
So let us check which objects have got annotations.
For every object we show its type, its XML id in the LAF source, its primary data and the two new feature values, if applicable.
msg("Looking for fresh annotations ...")
cur_verse = None
for node in NN():
otype = F.otype.v(node)
if otype == 'verse':
cur_verse = node
continue
intro = F.dirk_part_intro.v(node)
role = F.dirk_part_role.v(node)
if intro != None or role != None:
verse = F.label.v(cur_verse)
text = " ".join([txt for (n, txt) in P.data(node)])
xmlid = X.r(node)
print("{:<12} {:<6} id={:<8} {:<17}{:<16} {:<20}".format(
verse,
otype,
xmlid,
"intro={:<10} ".format(intro) if intro != None else '',
"role={:<10} ".format(role) if role != None else '',
text,
))
msg("Done")
10s Looking for fresh annotations ... 15s Done
GEN 01,01 word id=n20 intro=aap אֵ֥ת GEN 01,01 word id=n27 role=noot אֵ֥ת GEN 01,02 word id=n47 intro=mies role=karel תֹ֨הוּ֙