algebraixlib
Data Algebra Can Model XML¶This IPython notebook demonstrates that data algebra can be used to query XML. Here you will find a few simple examples of how to use algebraixlib
to create data algebra queries that are equivalent to XPath and XQuery queries.
We will use the XML file regions.xml
which is a transformation and aggregation of SQL [TPC-H][] region
and nation
tables into an XML document. A subset of the file follows:
[TPC-H]: http://www.tpc.org/tpch/ (TPC-H Benchmark Main Page)
from algebraixlib.io.xml import xml_to_str
from algebraixlib.util.miscellaneous import core
print(core(xml_to_str('TPC-H_Query5/regions.xml'), 985, 11))
<?xml version="1.0" ?> <regions> <region> <name>AFRICA</name> <regionkey>0</regionkey> <comment>lar deposits. blithely final packages cajole. regular waters are final requests. regular a...</comment> <nation> <name>ALGERIA</name> <nationkey>0</nationkey> <comment> haggle. carefully final deposits detect slyly agai</comment> </nation> <nation> <name>ETHIOPIA</name> <nationkey>5</nationkey> <comment>ven packages wake quickly. regu</comment> </nation> <nation> <name>KENYA</name> <nationkey>14</nationkey> <comment> pending excuses haggle furiously deposits. pending, express pinto beans wake fluffily past t</comment> </nation> <nation> <name>MOROCCO</name> <nationkey>15</nationkey> <comment>rns. blithely bold courts among the closely regular packages use furiously bold platelets?</comment> </nation> ... </regions>
algebriaxlib
supports importing XML data and transforming it into data algebra MathObjects. This is done using the XML module's import_xml
operation. The result of importing regions.xml
using import_xml
is a relation of relations.
Here's a subset of regions.xml
shown in data algebra representation. Note that, for simplicity sake, the MathObjects used in this notebook do not maintain order or multiplicity. For this reason elements of the following MathObjects may not be in the same order as the XML document or have duplicates.
from algebraixlib.util.mathobjectprinter import mo_to_str
from algebraixlib.io.xml import import_xml
regions_document = import_xml('TPC-H_Query5/regions.xml', convert_numerics=True)
print('regions_document:\n' + core(mo_to_str(regions_document), 1810, 3))
regions_document: Set({ Couplet(left=Atom('regions'), right= Set({ Couplet(left=Atom('region'), right= Set({ Couplet(left=Atom('nation'), right= Set({ Couplet(left=Atom('nationkey'), right=Atom(4)) Couplet(left=Atom('name'), right=Atom('EGYPT')) Couplet(left=Atom('comment'), right=Atom('y above the carefully unusu...) }) ) Couplet(left=Atom('name'), right=Atom('MIDDLE EAST')) Couplet(left=Atom('comment'), right=Atom('uickly special accounts cajole ca...) Couplet(left=Atom('nation'), right= Set({ Couplet(left=Atom('nationkey'), right=Atom(20)) Couplet(left=Atom('name'), right=Atom('SAUDI ARABIA')) Couplet(left=Atom('comment'), right=Atom('ts. silent requests haggle....) }) ) Couplet(left=Atom('regionkey'), right=Atom(4)) Couplet(left=Atom('nation'), right= Set({ Couplet(left=Atom('comment'), right=Atom('efully alongside of the sly...) Couplet(left=Atom('nationkey'), right=Atom(10)) Couplet(left=Atom('name'), right=Atom('IRAN')) }) ) Couplet(left=Atom('nation'), right= Set({ Couplet(left=Atom('nationkey'), right=Atom(13)) Couplet(left=Atom('comment'), right=Atom('ic deposits are blithely ab...) Couplet(left=Atom('name'), right=Atom('JORDAN')) }) ) Couplet(left=Atom('nation'), right= Set( ... })
Now that we have the XML document loaded, let's start with a very simple XPath statement to extract all of the regions' data:
/regions/region
This is translated using data algebra notation into the following:
$ \hspace{16pt} regions = \{right(R)\ :\ R \in regions\_document(regions) \text{ and } left(R) = region \} $
This expresses that regions
is the result of first retrieving from regions_document
the right component associated with the left component with the value regions
followed by retrieving all of the right components associated with the left component with the value region
.
Because XML requires a root element, properly formed XML documents are naturally a function at the top level. So we can refer to regions_document
as a Functional MathObject and we can use the function operator ()
which allows retrieving a right component associated with a left component but only on Functional MathObjects. Using this operator on regions_document
with the value regions
we can retrieve the relation containing all of the regions' data.
From this resulting relation we can extract all of it's components into a set using the bracket operator []
. The bracket operator, used with a relation or clan, allows retrieving the right components associated with a left. Note the distinction here, the function operator is used on a function to extract one right component while the bracket operator is used on a relation or clan to retrieve a set of all of the right components. So using the bracket operator on the result of regions_document('regions')
, a relation, with the value region
extracts each of the separate regions' data into a set.
algebraixlib
implementation:¶regions = regions_document('regions')['region']
print('regions:\n' + core(mo_to_str(regions), 2000, 8))
regions: Set({ Set({ Couplet(left=Atom('nation'), right= Set({ Couplet(left=Atom('nationkey'), right=Atom(4)) Couplet(left=Atom('name'), right=Atom('EGYPT')) Couplet(left=Atom('comment'), right=Atom('y above the carefully unusual theodo...) }) ) Couplet(left=Atom('name'), right=Atom('MIDDLE EAST')) Couplet(left=Atom('comment'), right=Atom('uickly special accounts cajole carefully b...) Couplet(left=Atom('nation'), right= Set({ Couplet(left=Atom('nationkey'), right=Atom(20)) Couplet(left=Atom('name'), right=Atom('SAUDI ARABIA')) Couplet(left=Atom('comment'), right=Atom('ts. silent requests haggle. closely ...) }) ) Couplet(left=Atom('regionkey'), right=Atom(4)) Couplet(left=Atom('nation'), right= Set({ Couplet(left=Atom('comment'), right=Atom('efully alongside of the slyly final ...) Couplet(left=Atom('nationkey'), right=Atom(10)) Couplet(left=Atom('name'), right=Atom('IRAN')) }) ) Couplet(left=Atom('nation'), right= Set({ Couplet(left=Atom('nationkey'), right=Atom(13)) Couplet(left=Atom('comment'), right=Atom('ic deposits are blithely about the c...) Couplet(left=Atom('name'), right=Atom('JORDAN')) }) ) Couplet(left=Atom('nation'), right= Set({ Couplet(left=Atom('comment'), right=Atom('nic deposits boost atop the quickly ...) Couplet(left=Atom('name'), right=Atom('IRAQ')) Couplet(left=Atom('nationkey'), right=Atom(11)) }) ) }) Set({ Couplet(left=Atom('nation'), right= Set({ Couplet(left=Atom('name'), right=Atom('BRAZIL')) Couplet(left=Atom('nationkey'), right=Atom(2)) Couplet(left=Atom('comment'), right=Atom('y alongside of the pending deposits....) }) ) ... }) })
Next we process an XPath Statement to retrieve all of the region keys.
/regions/region/regionkey
$ \hspace{16pt} region\_keys = \{right(R)\ :\ R \in regions \text{ and } left(R) = regionkey \} $
This expresses that region_keys
is the result of retrieving all of the right components from the relations in regions
where regionkey
is a left component.
In the implementation we use the square bracket operator to retrieve all of the right components for the regionkey
left components from the relations in the regions
clan. This results in a set of atoms, each atom a region key, for all of the region keys.
algebraixlib
Implementation¶region_keys = regions['regionkey']
print('region_keys: ' + mo_to_str(region_keys))
region_keys: Set({Atom(4), Atom(0), Atom(1), Atom(2), Atom(3)})
Next we process an XPath Statement to retrieve all of the nation names.
/regions/region/nation/name
$ \hspace{16pt} region\_nations = \{right(cp)\ :\ R \in regions \text{ and } cp \in R \text{ and } left(cp) = nation \} $
This expresses that region_nations
is the result of retrieving all of the relations from regions
, retrieving all of the couplets from the relations and then retrieving from the couplets all of the right components where the left component has the value nation
.
$ \hspace{16pt} nation\_names = \{right(cp)\ :\ R \in region\_nations \text{ and } cp \in R \text{ and } left(cp) = name \} $
This expresses that nation_names
is the result of retrieving all of the relations from region_nations
, retrieving all of the couplets from the relations and then retrieving from the couplets all of the right components where the left component has the value name
.
In the implementation we use the square bracket operator twice. First to retrieve a clan of all of the right components where the left components of the couplets in the relations in regions
have the value nation
. Then from that clan retrieve a set of all of the right components where the left components of the couplets in the relations have the value name
. This results in a set of atoms where each atom is a nation name.
algebraixlib
Implementation¶nation_names = regions['nation']['name']
print('nation_names:\n' + mo_to_str(nation_names))
nation_names: Set({Atom('JAPAN'), Atom('EGYPT'), Atom('MOROCCO'), Atom('ETHIOPIA'), Atom('MOZAMBIQUE'), Atom('ARGENTINA'), Atom('INDONESIA'), Atom('SAUDI ARABIA'), Atom('KENYA'), Atom('IRAN'), Atom('JORDAN'), Atom('GERMANY'), Atom('ALGERIA'), Atom('CANADA'), Atom('FRANCE'), Atom('INDIA'), Atom('BRAZIL'), Atom('CHINA'), Atom('UNITED KINGDOM'), Atom('PERU'), Atom('UNITED STATES'), Atom('ROMANIA'), Atom('VIETNAM'), Atom('RUSSIA'), Atom('IRAQ')})
Let's step it up and retrieve all nation names for a particular region.
for $x in doc("regions.xml")/regions/region where $x/name='AMERICA' return $x/nation/name
$ \hspace{16pt} america\_region = regions \blacktriangleright \{\{name{\mapsto}\text{'AMERICA'}\}\} $
This expresses that america_region
is the result from using superstriction on the clan regions
with a clan that has a single relation with a single couplet with the left component 'name' and the right component 'AMERICA'.
$ \hspace{16pt} america\_region\_nations = \{right(cp)\ :\ R \in america\_region \text{ and } cp \in R \text{ and } left(cp) = nation \} $
This expresses that america_region_nations
is the result of retrieving all of the relations from america_region
, retrieving all of the couplets from the relations and then retrieving from the couplets all of the right components where the left component has the value nation
.
$ \hspace{16pt} america\_nations\_names = \{right(cp)\ :\ R \in america\_region\_nations \text{ and } cp \in R \text{ and } left(cp) = name \} $
This expresses that america_nation_names
is the result of retrieving all of the relations from america_region_nations
, retrieving all of the couplets from the relations and then retrieving from the couplets all of the right components where the left component has the value name
.
In our implementation we first extract the region data from all regions with the name 'AMERICA' so we can isolate the region data used for the next operation. To do so we use the superstrict
operation to retrieve only those relations that correspond to a clan superset, in this case a clan with a single relation mapping 'name' to 'AMERICA'. From that result we use the square bracket operator twice. First to retrieve a clan of all of the right components where the left components of the couplets in the relations in america_region
have the value nation
. Then from that clan retrieve a set of all of the right components where the left components of the couplets in the relations have the value name
. This results in a set of atoms, each atom a nation name, comprising all of the nation names from the region with the name 'AMERICA'
algebraixlib
implementation:¶from algebraixlib.algebras.clans import superstrict, from_dict
america_region_name = from_dict({'name': 'AMERICA'})
america_region = superstrict(regions, america_region_name)
america_nations_names = america_region['nation']['name']
print('america_nations_names:\n' + mo_to_str(america_nations_names))
america_nations_names: Set({Atom('CANADA'), Atom('ARGENTINA'), Atom('UNITED STATES'), Atom('PERU'), Atom('BRAZIL')})
Now let's retrieve the nation names and their corresponding region key as pairs. This is more complicated because each region must be evaluated individually to determine the region key and nation names which are then aggregated.
for $x in doc("regions.xml")/regions/region/nation/name return <pair>{$x/../../regionkey}{$x}</pair>
Note: the following are completed once for each region
relation in the regions
clan.
$ \hspace{16pt} region\_key = project(\{region\}, 'regionkey'\} $
This expresses that region_key
is the result of using projection with the relation region
that has been turned into a clan (by embedding it into a set using curly braces '{}') and 'regionkey'. This provides the region key for this region.
$ \hspace{16pt} region\_nations = \{right(R)\ :\ R \in region \text{ and } left(R) = nation \} $
This expresses that region_nations
is the result of using the bracket operator to retrieve all of the right components from region
where nation
is a left component. This extracts the nation data out of the region data.
$ \hspace{16pt} region\_nation\_names = project(region\_nations, 'name'\} $
This expresses that region_nation_names
is the result of using projection with the clan region_nations
and 'name'. This produces a clan containing the nation names for this region.
$ \hspace{16pt} region\_key\_nation\_name\_pairs = region\_key \blacktriangledown region\_nation\_names $
This expresses that region_key_nation_name_pairs
is the result of using cross-union with region_key
and region_nation_names
. This results in a clan which contains each nation name and its region key.
$ \hspace{16pt} region\_key\_nation\_name\_pairs\_accumulator = region\_key\_nation\_name\_pairs\_accumulator \cup region\_key\_nation\_name\_pairs $
This expresses that region_key_nation_name_pairs_accumulator
is the result of its union with region_key_nation_name_pairs
. This produces a new clan that is the accumulation of all of the nation name and region key pairs.
In the implementation we iterate each of the region
relations in the regions
clan. For each region
relation we insert it into a set in order to make it a clan and use projection with 'regionkey' to retrieve this region's regionkey
as a clan. Next we use the bracket operator with region
to retrieve a clan of all of the nations' data for this region. Then we reduce that data to a clan of nation name couplets by way of projection and use cross-union to pair the region key clan with the nation name clan. This pairs each nation name couplet with the region key couplet to form a clan with this region's region key and nation name as pairs. Finally the region key and nation name pairs clan is set unioned into a value external to the loop that accumulates all of the region key and nation name pairs' clans generated during each iteration.
algebraixlib
implementation:¶from algebraixlib.mathobjects import Set
from algebraixlib.algebras.clans import project, cross_union
from algebraixlib.algebras.sets import union
region_key_nation_name_pairs_accumulator = Set()
for region in regions:
region_key = project(Set(region), 'regionkey')
region_nations = region['nation']
region_nation_names = project(region_nations, 'name')
region_key_nation_name_pairs = cross_union(region_key, region_nation_names)
region_key_nation_name_pairs_accumulator = union(region_key_nation_name_pairs_accumulator, region_key_nation_name_pairs)
print('region key nation name pairs:\n' + core(mo_to_str(region_key_nation_name_pairs_accumulator), 600, 8))
region key nation name pairs: Set({ Set({ Couplet(left=Atom('regionkey'), right=Atom(0)) Couplet(left=Atom('name'), right=Atom('KENYA')) }) Set({ Couplet(left=Atom('name'), right=Atom('BRAZIL')) Couplet(left=Atom('regionkey'), right=Atom(1)) }) Set({ Couplet(left=Atom('name'), right=Atom('UNITED KINGDOM')) Couplet(left=Atom('regionkey'), right=Atom(3)) }) Set({ Couplet(left=Atom('name'), right=Atom('ARGENTINA')) Couplet(left=Atom('regionkey'), right=Atom(1)) }) Set({ Couplet(left=Atom('regionkey'), right=Atom(2)) Couplet(left=Atom('name' ... }) })
For our final query we retrieve the region name of a region which includes a particular nation name.
for $x in doc("regions.xml")/regions/region/nation[name="UNITED STATES"] return $x/../name
We can build upon our last example by reusing its resulting collection of region key and nation name pairs.
$ \hspace{16pt} us\_region\_key\_nation\_name\_pair = region\_key\_nation\_name\_pairs \blacktriangleright \{\{'name'{\mapsto}\text{'UNITED STATES}\}\} $
This expresses that us_region_key_nation_name_pair
is the result of using [superstriction][] with region_key_nation_name_pairs
and the clan with a single element with 'name' for the left component and 'UNITED STATES' for the right component. This has the effect of eliminating all of the region key and nation name pairs except the one with the nation name which is 'UNITED STATES'.
$ \hspace{16pt} us\_region\_key = project(us\_region\_key\_nation\_name\_pair, 'regionkey') $
This expresses that us_region_key
is the result of using projection with the clan us_region_key_nation_name_pair
and the diagonal with the element 'regionkey'. This produces a clan containing the region key associated with the nation named 'UNITED STATES'.
$ \hspace{16pt} us\_region = regions \blacktriangleright us\_region\_key $
This expresses that us_region
is the result of using superstriction with regions
and the clan us_region_key
. This produces a clan with only the region data for the region containing the nation named 'UNITED STATES'.
$ \hspace{16pt} us\_region\_name = project(us\_region, 'name') $
This expresses that us_region_name
is the result of using projection with the clan us_region
and the diagonal with element 'name'. This extracts the region name of the region containing the nation named 'UNITED STATES'.
In the implementation we start with a clan created using a set with one relation with one couplet having the left component 'name' and the right component 'UNITED STATES'. This is needed as a clan because it is used in superstriction, a clan operation, with the clan region_key_nation_name_pairs
. The superstriction is used to extract the region key and nation name which includes the nation named 'UNITED STATES'. From the resulting clan we project out the region keys and use them in a superstriction with regions
to extract the region corresponding to the region key. Lastly we project out the name for the region that inclues the nation named 'UNITED STATES': 'AMERICA'.
algebraixlib
Implementation¶from algebraixlib.mathobjects.couplet import Couplet
us_nation_name = from_dict({'name': 'UNITED STATES'})
us_region_key_nation_name_pair = superstrict(region_key_nation_name_pairs_accumulator, us_nation_name)
us_region_key = project(us_region_key_nation_name_pair, 'regionkey')
us_region = superstrict(regions, us_region_key)
us_region_name = project(us_region, 'name')
print('us_region_name:\n' + mo_to_str(us_region_name))
us_region_name: Set({ Set({ Couplet(left=Atom('name'), right=Atom('AMERICA')) }) })
© Copyright Permission.io, Inc. (formerly known as Algebraix Data Corporation), Copyright (c) 2022.
This file is part of algebraixlib
.
algebraixlib
is free software: you can redistribute it and/or modify it under the terms of version 3 of the GNU Lesser General Public License as published by the Free Software Foundation.
algebraixlib
is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public License along with algebraixlib
. If not, see GNU licenses.