Notebook

`algebraixlib` Data Algebra Can Model XML¶

This IPython notebook demonstrates that data algebra can be used to query XML. Here you will find a few simple examples of how to use algebraixlib to create data algebra queries that are equivalent to XPath and XQuery queries.

We will use the XML file regions.xml which is a transformation and aggregation of SQL [TPC-H][] region and nation tables into an XML document. A subset of the file follows: [TPC-H]: http://www.tpc.org/tpch/ (TPC-H Benchmark Main Page)

In [1]:

from algebraixlib.io.xml import xml_to_str
from algebraixlib.util.miscellaneous import core
print(core(xml_to_str('TPC-H_Query5/regions.xml'), 985, 11))

<?xml version="1.0" ?>
<regions>
   <region>
      <name>AFRICA</name>
      <regionkey>0</regionkey>
      <comment>lar deposits. blithely final packages cajole. regular waters are final requests. regular a...</comment>
      <nation>
         <name>ALGERIA</name>
         <nationkey>0</nationkey>
         <comment> haggle. carefully final deposits detect slyly agai</comment>
      </nation>
      <nation>
         <name>ETHIOPIA</name>
         <nationkey>5</nationkey>
         <comment>ven packages wake quickly. regu</comment>
      </nation>
      <nation>
         <name>KENYA</name>
         <nationkey>14</nationkey>
         <comment> pending excuses haggle furiously deposits. pending, express pinto beans wake fluffily past t</comment>
      </nation>
      <nation>
         <name>MOROCCO</name>
         <nationkey>15</nationkey>
         <comment>rns. blithely bold courts among the closely regular packages use furiously bold platelets?</comment>
      </nation>
  
...
</regions>

XML Import¶

algebriaxlib supports importing XML data and transforming it into data algebra MathObjects. This is done using the XML module's import_xml operation. The result of importing regions.xml using import_xml is a relation of relations.

Here's a subset of regions.xml shown in data algebra representation. Note that, for simplicity sake, the MathObjects used in this notebook do not maintain order or multiplicity. For this reason elements of the following MathObjects may not be in the same order as the XML document or have duplicates.

In [2]:

from algebraixlib.util.mathobjectprinter import mo_to_str
from algebraixlib.io.xml import import_xml

regions_document = import_xml('TPC-H_Query5/regions.xml', convert_numerics=True)
print('regions_document:\n' + core(mo_to_str(regions_document), 1810, 3))

regions_document:
Set({
   Couplet(left=Atom('regions'), right=
      Set({
         Couplet(left=Atom('region'), right=
            Set({
               Couplet(left=Atom('nation'), right=
                  Set({
                     Couplet(left=Atom('nationkey'), right=Atom(4))
                     Couplet(left=Atom('name'), right=Atom('EGYPT'))
                     Couplet(left=Atom('comment'), right=Atom('y above the carefully unusu...)
                  })
               )
               Couplet(left=Atom('name'), right=Atom('MIDDLE EAST'))
               Couplet(left=Atom('comment'), right=Atom('uickly special accounts cajole ca...)
               Couplet(left=Atom('nation'), right=
                  Set({
                     Couplet(left=Atom('nationkey'), right=Atom(20))
                     Couplet(left=Atom('name'), right=Atom('SAUDI ARABIA'))
                     Couplet(left=Atom('comment'), right=Atom('ts. silent requests haggle....)
                  })
               )
               Couplet(left=Atom('regionkey'), right=Atom(4))
               Couplet(left=Atom('nation'), right=
                  Set({
                     Couplet(left=Atom('comment'), right=Atom('efully alongside of the sly...)
                     Couplet(left=Atom('nationkey'), right=Atom(10))
                     Couplet(left=Atom('name'), right=Atom('IRAN'))
                  })
               )
               Couplet(left=Atom('nation'), right=
                  Set({
                     Couplet(left=Atom('nationkey'), right=Atom(13))
                     Couplet(left=Atom('comment'), right=Atom('ic deposits are blithely ab...)
                     Couplet(left=Atom('name'), right=Atom('JORDAN'))
                  })
               )
               Couplet(left=Atom('nation'), right=
                  Set(
...
})

XPath basics¶

Regions¶

Now that we have the XML document loaded, let's start with a very simple XPath statement to extract all of the regions' data:

/regions/region

This is translated using data algebra notation into the following:

$ \hspace{16pt} regions = \{right(R)\ :\ R \in regions\_document(regions) \text{ and } left(R) = region \} $

This expresses that regions is the result of first retrieving from regions_document the right component associated with the left component with the value regions followed by retrieving all of the right components associated with the left component with the value region.

Because XML requires a root element, properly formed XML documents are naturally a function at the top level. So we can refer to regions_document as a Functional MathObject and we can use the function operator () which allows retrieving a right component associated with a left component but only on Functional MathObjects. Using this operator on regions_document with the value regions we can retrieve the relation containing all of the regions' data.

From this resulting relation we can extract all of it's components into a set using the bracket operator []. The bracket operator, used with a relation or clan, allows retrieving the right components associated with a left. Note the distinction here, the function operator is used on a function to extract one right component while the bracket operator is used on a relation or clan to retrieve a set of all of the right components. So using the bracket operator on the result of regions_document('regions'), a relation, with the value region extracts each of the separate regions' data into a set.

`algebraixlib` implementation:¶

In [3]:

regions = regions_document('regions')['region']
print('regions:\n' + core(mo_to_str(regions), 2000, 8))

regions:
Set({
   Set({
      Couplet(left=Atom('nation'), right=
         Set({
            Couplet(left=Atom('nationkey'), right=Atom(4))
            Couplet(left=Atom('name'), right=Atom('EGYPT'))
            Couplet(left=Atom('comment'), right=Atom('y above the carefully unusual theodo...)
         })
      )
      Couplet(left=Atom('name'), right=Atom('MIDDLE EAST'))
      Couplet(left=Atom('comment'), right=Atom('uickly special accounts cajole carefully b...)
      Couplet(left=Atom('nation'), right=
         Set({
            Couplet(left=Atom('nationkey'), right=Atom(20))
            Couplet(left=Atom('name'), right=Atom('SAUDI ARABIA'))
            Couplet(left=Atom('comment'), right=Atom('ts. silent requests haggle. closely ...)
         })
      )
      Couplet(left=Atom('regionkey'), right=Atom(4))
      Couplet(left=Atom('nation'), right=
         Set({
            Couplet(left=Atom('comment'), right=Atom('efully alongside of the slyly final ...)
            Couplet(left=Atom('nationkey'), right=Atom(10))
            Couplet(left=Atom('name'), right=Atom('IRAN'))
         })
      )
      Couplet(left=Atom('nation'), right=
         Set({
            Couplet(left=Atom('nationkey'), right=Atom(13))
            Couplet(left=Atom('comment'), right=Atom('ic deposits are blithely about the c...)
            Couplet(left=Atom('name'), right=Atom('JORDAN'))
         })
      )
      Couplet(left=Atom('nation'), right=
         Set({
            Couplet(left=Atom('comment'), right=Atom('nic deposits boost atop the quickly ...)
            Couplet(left=Atom('name'), right=Atom('IRAQ'))
            Couplet(left=Atom('nationkey'), right=Atom(11))
         })
      )
   })
   Set({
      Couplet(left=Atom('nation'), right=
         Set({
            Couplet(left=Atom('name'), right=Atom('BRAZIL'))
            Couplet(left=Atom('nationkey'), right=Atom(2))
            Couplet(left=Atom('comment'), right=Atom('y alongside of the pending deposits....)
         })
      )
     
...
  })
})

Region Keys¶

Next we process an XPath Statement to retrieve all of the region keys.

XPath Statement¶

/regions/region/regionkey

Data Algebra Translation¶

$ \hspace{16pt} region\_keys = \{right(R)\ :\ R \in regions \text{ and } left(R) = regionkey \} $

This expresses that region_keys is the result of retrieving all of the right components from the relations in regions where regionkey is a left component.

In the implementation we use the square bracket operator to retrieve all of the right components for the regionkey left components from the relations in the regions clan. This results in a set of atoms, each atom a region key, for all of the region keys.

`algebraixlib` Implementation¶

In [4]:

region_keys = regions['regionkey']
print('region_keys: ' + mo_to_str(region_keys))

region_keys: Set({Atom(4), Atom(0), Atom(1), Atom(2), Atom(3)})

Nation Names¶

Next we process an XPath Statement to retrieve all of the nation names.

XPath Statement¶

/regions/region/nation/name

Data Algebra Translation¶

$ \hspace{16pt} region\_nations = \{right(cp)\ :\ R \in regions \text{ and } cp \in R \text{ and } left(cp) = nation \} $

This expresses that region_nations is the result of retrieving all of the relations from regions, retrieving all of the couplets from the relations and then retrieving from the couplets all of the right components where the left component has the value nation.

$ \hspace{16pt} nation\_names = \{right(cp)\ :\ R \in region\_nations \text{ and } cp \in R \text{ and } left(cp) = name \} $

This expresses that nation_names is the result of retrieving all of the relations from region_nations, retrieving all of the couplets from the relations and then retrieving from the couplets all of the right components where the left component has the value name.

In the implementation we use the square bracket operator twice. First to retrieve a clan of all of the right components where the left components of the couplets in the relations in regions have the value nation. Then from that clan retrieve a set of all of the right components where the left components of the couplets in the relations have the value name. This results in a set of atoms where each atom is a nation name.

`algebraixlib` Implementation¶

In [5]:

nation_names = regions['nation']['name']
print('nation_names:\n' + mo_to_str(nation_names))

nation_names:
Set({Atom('JAPAN'), Atom('EGYPT'), Atom('MOROCCO'), Atom('ETHIOPIA'), Atom('MOZAMBIQUE'), Atom('ARGENTINA'), Atom('INDONESIA'), Atom('SAUDI ARABIA'), Atom('KENYA'), Atom('IRAN'), Atom('JORDAN'), Atom('GERMANY'), Atom('ALGERIA'), Atom('CANADA'), Atom('FRANCE'), Atom('INDIA'), Atom('BRAZIL'), Atom('CHINA'), Atom('UNITED KINGDOM'), Atom('PERU'), Atom('UNITED STATES'), Atom('ROMANIA'), Atom('VIETNAM'), Atom('RUSSIA'), Atom('IRAQ')})

XQuery Basics¶

Let's step it up and retrieve all nation names for a particular region.

XQuery Expression¶

for $x in doc("regions.xml")/regions/region where $x/name='AMERICA' return $x/nation/name

Data Algebra Translation¶

$ \hspace{16pt} america\_region = regions \blacktriangleright \{\{name{\mapsto}\text{'AMERICA'}\}\} $

This expresses that america_region is the result from using superstriction on the clan regions with a clan that has a single relation with a single couplet with the left component 'name' and the right component 'AMERICA'.

$ \hspace{16pt} america\_region\_nations = \{right(cp)\ :\ R \in america\_region \text{ and } cp \in R \text{ and } left(cp) = nation \} $

This expresses that america_region_nations is the result of retrieving all of the relations from america_region, retrieving all of the couplets from the relations and then retrieving from the couplets all of the right components where the left component has the value nation.

$ \hspace{16pt} america\_nations\_names = \{right(cp)\ :\ R \in america\_region\_nations \text{ and } cp \in R \text{ and } left(cp) = name \} $

This expresses that america_nation_names is the result of retrieving all of the relations from america_region_nations, retrieving all of the couplets from the relations and then retrieving from the couplets all of the right components where the left component has the value name.

In our implementation we first extract the region data from all regions with the name 'AMERICA' so we can isolate the region data used for the next operation. To do so we use the superstrict operation to retrieve only those relations that correspond to a clan superset, in this case a clan with a single relation mapping 'name' to 'AMERICA'. From that result we use the square bracket operator twice. First to retrieve a clan of all of the right components where the left components of the couplets in the relations in america_region have the value nation. Then from that clan retrieve a set of all of the right components where the left components of the couplets in the relations have the value name. This results in a set of atoms, each atom a nation name, comprising all of the nation names from the region with the name 'AMERICA'

`algebraixlib` implementation:¶

In [6]:

from algebraixlib.algebras.clans import superstrict, from_dict
america_region_name = from_dict({'name': 'AMERICA'})
america_region = superstrict(regions, america_region_name)
america_nations_names = america_region['nation']['name']
print('america_nations_names:\n' + mo_to_str(america_nations_names))

america_nations_names:
Set({Atom('CANADA'), Atom('ARGENTINA'), Atom('UNITED STATES'), Atom('PERU'), Atom('BRAZIL')})

All Region Key and Nation Name Pairs¶

Now let's retrieve the nation names and their corresponding region key as pairs. This is more complicated because each region must be evaluated individually to determine the region key and nation names which are then aggregated.

XQuery Expression¶

for $x in doc("regions.xml")/regions/region/nation/name return <pair>{$x/../../regionkey}{$x}</pair>

Data algebra translation:¶

Note: the following are completed once for each region relation in the regions clan.

$ \hspace{16pt} region\_key = project(\{region\}, 'regionkey'\} $

This expresses that region_key is the result of using projection with the relation region that has been turned into a clan (by embedding it into a set using curly braces '{}') and 'regionkey'. This provides the region key for this region.

$ \hspace{16pt} region\_nations = \{right(R)\ :\ R \in region \text{ and } left(R) = nation \} $

This expresses that region_nations is the result of using the bracket operator to retrieve all of the right components from region where nation is a left component. This extracts the nation data out of the region data.

$ \hspace{16pt} region\_nation\_names = project(region\_nations, 'name'\} $

This expresses that region_nation_names is the result of using projection with the clan region_nations and 'name'. This produces a clan containing the nation names for this region.

$ \hspace{16pt} region\_key\_nation\_name\_pairs = region\_key \blacktriangledown region\_nation\_names $

This expresses that region_key_nation_name_pairs is the result of using cross-union with region_key and region_nation_names. This results in a clan which contains each nation name and its region key.

$ \hspace{16pt} region\_key\_nation\_name\_pairs\_accumulator = region\_key\_nation\_name\_pairs\_accumulator \cup region\_key\_nation\_name\_pairs $

This expresses that region_key_nation_name_pairs_accumulator is the result of its union with region_key_nation_name_pairs. This produces a new clan that is the accumulation of all of the nation name and region key pairs.

In the implementation we iterate each of the region relations in the regions clan. For each region relation we insert it into a set in order to make it a clan and use projection with 'regionkey' to retrieve this region's regionkey as a clan. Next we use the bracket operator with region to retrieve a clan of all of the nations' data for this region. Then we reduce that data to a clan of nation name couplets by way of projection and use cross-union to pair the region key clan with the nation name clan. This pairs each nation name couplet with the region key couplet to form a clan with this region's region key and nation name as pairs. Finally the region key and nation name pairs clan is set unioned into a value external to the loop that accumulates all of the region key and nation name pairs' clans generated during each iteration.

`algebraixlib` implementation:¶

In [7]:

from algebraixlib.mathobjects import Set
from algebraixlib.algebras.clans import project, cross_union
from algebraixlib.algebras.sets import union
region_key_nation_name_pairs_accumulator = Set()
for region in regions:
    region_key = project(Set(region), 'regionkey')
    region_nations = region['nation']
    region_nation_names = project(region_nations, 'name')
    region_key_nation_name_pairs = cross_union(region_key, region_nation_names)
    region_key_nation_name_pairs_accumulator = union(region_key_nation_name_pairs_accumulator, region_key_nation_name_pairs)
    
print('region key nation name pairs:\n' + core(mo_to_str(region_key_nation_name_pairs_accumulator), 600, 8))

region key nation name pairs:
Set({
   Set({
      Couplet(left=Atom('regionkey'), right=Atom(0))
      Couplet(left=Atom('name'), right=Atom('KENYA'))
   })
   Set({
      Couplet(left=Atom('name'), right=Atom('BRAZIL'))
      Couplet(left=Atom('regionkey'), right=Atom(1))
   })
   Set({
      Couplet(left=Atom('name'), right=Atom('UNITED KINGDOM'))
      Couplet(left=Atom('regionkey'), right=Atom(3))
   })
   Set({
      Couplet(left=Atom('name'), right=Atom('ARGENTINA'))
      Couplet(left=Atom('regionkey'), right=Atom(1))
   })
   Set({
      Couplet(left=Atom('regionkey'), right=Atom(2))
      Couplet(left=Atom('name'
...
  })
})

All Regions With Nation Named "UNITED STATES"¶

For our final query we retrieve the region name of a region which includes a particular nation name.

XQuery Expression¶

for $x in doc("regions.xml")/regions/region/nation[name="UNITED STATES"] return $x/../name

Data Algebra Translation¶

We can build upon our last example by reusing its resulting collection of region key and nation name pairs.

$ \hspace{16pt} us\_region\_key\_nation\_name\_pair = region\_key\_nation\_name\_pairs \blacktriangleright \{\{'name'{\mapsto}\text{'UNITED STATES}\}\} $

This expresses that us_region_key_nation_name_pair is the result of using [superstriction][] with region_key_nation_name_pairs and the clan with a single element with 'name' for the left component and 'UNITED STATES' for the right component. This has the effect of eliminating all of the region key and nation name pairs except the one with the nation name which is 'UNITED STATES'.

$ \hspace{16pt} us\_region\_key = project(us\_region\_key\_nation\_name\_pair, 'regionkey') $

This expresses that us_region_key is the result of using projection with the clan us_region_key_nation_name_pair and the diagonal with the element 'regionkey'. This produces a clan containing the region key associated with the nation named 'UNITED STATES'.

$ \hspace{16pt} us\_region = regions \blacktriangleright us\_region\_key $

This expresses that us_region is the result of using superstriction with regions and the clan us_region_key. This produces a clan with only the region data for the region containing the nation named 'UNITED STATES'.

$ \hspace{16pt} us\_region\_name = project(us\_region, 'name') $

This expresses that us_region_name is the result of using projection with the clan us_region and the diagonal with element 'name'. This extracts the region name of the region containing the nation named 'UNITED STATES'.

In the implementation we start with a clan created using a set with one relation with one couplet having the left component 'name' and the right component 'UNITED STATES'. This is needed as a clan because it is used in superstriction, a clan operation, with the clan region_key_nation_name_pairs. The superstriction is used to extract the region key and nation name which includes the nation named 'UNITED STATES'. From the resulting clan we project out the region keys and use them in a superstriction with regions to extract the region corresponding to the region key. Lastly we project out the name for the region that inclues the nation named 'UNITED STATES': 'AMERICA'.

`algebraixlib` Implementation¶

In [8]:

from algebraixlib.mathobjects.couplet import Couplet
us_nation_name = from_dict({'name': 'UNITED STATES'})
us_region_key_nation_name_pair = superstrict(region_key_nation_name_pairs_accumulator, us_nation_name)
us_region_key = project(us_region_key_nation_name_pair, 'regionkey')
us_region = superstrict(regions, us_region_key)
us_region_name = project(us_region, 'name')
print('us_region_name:\n' + mo_to_str(us_region_name))

us_region_name:
Set({
   Set({
      Couplet(left=Atom('name'), right=Atom('AMERICA'))
   })
})

This file is part of algebraixlib .

algebraixlib is free software: you can redistribute it and/or modify it under the terms of version 3 of the GNU Lesser General Public License as published by the Free Software Foundation.

algebraixlib is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with algebraixlib. If not, see GNU licenses.

algebraixlib Data Algebra Can Model XML¶

XML Import¶

XPath basics¶

Regions¶

algebraixlib implementation:¶

Region Keys¶

XPath Statement¶

Data Algebra Translation¶

algebraixlib Implementation¶

Nation Names¶

XPath Statement¶

Data Algebra Translation¶

algebraixlib Implementation¶

XQuery Basics¶

All Region Related Nation Names¶

XQuery Expression¶

Data Algebra Translation¶

algebraixlib implementation:¶

All Region Key and Nation Name Pairs¶

XQuery Expression¶

Data algebra translation:¶

algebraixlib implementation:¶

All Regions With Nation Named "UNITED STATES"¶

XQuery Expression¶

Data Algebra Translation¶

algebraixlib Implementation¶

`algebraixlib` Data Algebra Can Model XML¶

`algebraixlib` implementation:¶

`algebraixlib` Implementation¶

`algebraixlib` Implementation¶

`algebraixlib` implementation:¶

`algebraixlib` implementation:¶

`algebraixlib` Implementation¶