#!/usr/bin/env python # coding: utf-8 # # `algebraixlib` Data Algebra Can Model XML # # This IPython notebook demonstrates that data algebra can be used to query XML. Here you will find a few simple examples of how to use `algebraixlib` to create data algebra queries that are equivalent to XPath and XQuery queries. # # We will use the XML file `regions.xml` which is a transformation and aggregation of SQL [TPC-H][] `region` and `nation` tables into an XML document. A subset of the file follows: # [TPC-H]: (TPC-H Benchmark Main Page) # In[1]: from algebraixlib.io.xml import xml_to_str from algebraixlib.util.miscellaneous import core print(core(xml_to_str('TPC-H_Query5/regions.xml'), 985, 11)) # ## XML Import # # `algebriaxlib` supports importing XML data and transforming it into data algebra [MathObject][]s. This is done using the XML module's `import_xml` operation. The result of importing `regions.xml` using `import_xml` is a [relation][] of relations. # # Here's a subset of `regions.xml` shown in data algebra representation. Note that, for simplicity sake, the MathObjects used in this notebook do not maintain order or multiplicity. For this reason elements of the following MathObjects may not be in the same order as the XML document or have duplicates. # # [MathObject]: # # [relation]: # In[2]: from algebraixlib.util.mathobjectprinter import mo_to_str from algebraixlib.io.xml import import_xml regions_document = import_xml('TPC-H_Query5/regions.xml', convert_numerics=True) print('regions_document:\n' + core(mo_to_str(regions_document), 1810, 3)) # ##XPath basics # ### Regions # # Now that we have the XML document loaded, let's start with a very simple XPath statement to extract all of the regions' data: # # ```shell # /regions/region # ``` # # This is translated using data algebra notation into the following: # # $ # \hspace{16pt} regions = \{right(R)\ :\ R \in regions\_document(regions) \text{ and } left(R) = region \} # $ # # This expresses that `regions` is the result of first retrieving from `regions_document` the [right component][] associated with the [left component][] with the value `regions` followed by retrieving all of the right components associated with the left component with the value `region`. # # Because XML requires a root element, properly formed XML documents are naturally a [function][] at the top level. So we can refer to `regions_document` as a [Functional][] MathObject and we can use the function operator `()` which allows retrieving a [right component][] associated with a [left component][] but only on Functional MathObjects. Using this operator on `regions_document` with the value `regions` we can retrieve the relation containing all of the regions' data. # # From this resulting relation we can extract all of it's components into a [set][] using the bracket operator `[]`. The bracket operator, used with a relation or [clan][], allows retrieving the right components associated with a left. Note the distinction here, the function operator is used on a function to extract one right component while the bracket operator is used on a relation or clan to retrieve a set of all of the right components. So using the bracket operator on the result of `regions_document('regions')`, a relation, with the value `region` extracts each of the separate regions' data into a set. # # ####`algebraixlib` implementation: # # [set]: # [clan]: # [relation]: # [function]: # [Functional]: # [right component]: # [left component]: # In[3]: regions = regions_document('regions')['region'] print('regions:\n' + core(mo_to_str(regions), 2000, 8)) # ### Region Keys # # Next we process an XPath Statement to retrieve all of the region keys. # # ####XPath Statement # # ```shell # /regions/region/regionkey # ``` # # ####Data Algebra Translation # # $ # \hspace{16pt} region\_keys = \{right(R)\ :\ R \in regions \text{ and } left(R) = regionkey \} # $ # # This expresses that `region_keys` is the result of retrieving all of the right components from the relations in `regions` where `regionkey` is a left component. # # In the implementation we use the square bracket operator to retrieve all of the right components for the `regionkey` left components from the relations in the `regions` clan. This results in a [set][] of [atom][]s, each atom a region key, for all of the region keys. # # ####`algebraixlib` Implementation # # [set]: # [atom]: # In[4]: region_keys = regions['regionkey'] print('region_keys: ' + mo_to_str(region_keys)) # ### Nation Names # # Next we process an XPath Statement to retrieve all of the nation names. # # ####XPath Statement # # ```shell # /regions/region/nation/name # ``` # # ####Data Algebra Translation # # $ # \hspace{16pt} region\_nations = \{right(cp)\ :\ R \in regions \text{ and } cp \in R \text{ and } left(cp) = nation \} # $ # # This expresses that `region_nations` is the result of retrieving all of the relations from `regions`, retrieving all of the couplets from the relations and then retrieving from the [couplet][]s all of the right components where the left component has the value `nation`. # # $ # \hspace{16pt} nation\_names = \{right(cp)\ :\ R \in region\_nations \text{ and } cp \in R \text{ and } left(cp) = name \} # $ # # This expresses that `nation_names` is the result of retrieving all of the relations from `region_nations`, retrieving all of the couplets from the relations and then retrieving from the couplets all of the right components where the left component has the value `name`. # # In the implementation we use the square bracket operator twice. First to retrieve a clan of all of the right components where the left components of the couplets in the relations in `regions` have the value `nation`. Then from that clan retrieve a set of all of the right components where the left components of the couplets in the relations have the value `name`. This results in a set of atoms where each atom is a nation name. # # ####`algebraixlib` Implementation # [couplet]: # In[5]: nation_names = regions['nation']['name'] print('nation_names:\n' + mo_to_str(nation_names)) # ## XQuery Basics # ### All Region Related Nation Names # # Let's step it up and retrieve all nation names for a particular region. # # ####XQuery Expression # # ```shell # for $x in doc("regions.xml")/regions/region where $x/name='AMERICA' return $x/nation/name # ``` # # ####Data Algebra Translation # # $ # \hspace{16pt} america\_region = regions \blacktriangleright \{\{name{\mapsto}\text{'AMERICA'}\}\} # $ # # This expresses that `america_region` is the result from using [superstriction][] on the clan `regions` with a clan that has a single relation with a single couplet with the left component 'name' and the right component 'AMERICA'. # # $ # \hspace{16pt} america\_region\_nations = \{right(cp)\ :\ R \in america\_region \text{ and } cp \in R \text{ and } left(cp) = nation \} # $ # # This expresses that `america_region_nations` is the result of retrieving all of the relations from `america_region`, retrieving all of the couplets from the relations and then retrieving from the couplets all of the right components where the left component has the value `nation`. # # $ # \hspace{16pt} america\_nations\_names = \{right(cp)\ :\ R \in america\_region\_nations \text{ and } cp \in R \text{ and } left(cp) = name \} # $ # # This expresses that `america_nation_names` is the result of retrieving all of the relations from `america_region_nations`, retrieving all of the couplets from the relations and then retrieving from the couplets all of the right components where the left component has the value `name`. # # In our implementation we first extract the region data from all regions with the name 'AMERICA' so we can isolate the region data used for the next operation. To do so we use the `superstrict` operation to retrieve only those relations that correspond to a clan superset, in this case a clan with a single relation mapping 'name' to 'AMERICA'. From that result we use the square bracket operator twice. First to retrieve a clan of all of the right components where the left components of the couplets in the relations in `america_region` have the value `nation`. Then from that clan retrieve a set of all of the right components where the left components of the couplets in the relations have the value `name`. This results in a set of atoms, each atom a nation name, comprising all of the nation names from the region with the name 'AMERICA' # # ####`algebraixlib` implementation: # # [superstriction]: # In[6]: from algebraixlib.algebras.clans import superstrict, from_dict america_region_name = from_dict({'name': 'AMERICA'}) america_region = superstrict(regions, america_region_name) america_nations_names = america_region['nation']['name'] print('america_nations_names:\n' + mo_to_str(america_nations_names)) # ### All Region Key and Nation Name Pairs # # Now let's retrieve the nation names and their corresponding region key as pairs. This is more complicated because each region must be evaluated individually to determine the region key and nation names which are then aggregated. # # ####XQuery Expression # # ```shell # for $x in doc("regions.xml")/regions/region/nation/name return {$x/../../regionkey}{$x} # ``` # # ####Data algebra translation: # # # **Note: the following are completed once for each `region` relation in the `regions` clan.** # # $ # \hspace{16pt} region\_key = project(\{region\}, 'regionkey'\} # $ # # This expresses that `region_key` is the result of using projection with the relation `region` that has been turned into a clan (by embedding it into a set using curly braces '{}') and 'regionkey'. This provides the region key for this region. # # $ # \hspace{16pt} region\_nations = \{right(R)\ :\ R \in region \text{ and } left(R) = nation \} # $ # # This expresses that `region_nations` is the result of using the bracket operator to retrieve all of the right components from `region` where `nation` is a left component. This extracts the nation data out of the region data. # # $ # \hspace{16pt} region\_nation\_names = project(region\_nations, 'name'\} # $ # # This expresses that `region_nation_names` is the result of using projection with the clan `region_nations` and 'name'. This produces a clan containing the nation names for this region. # # # $ # \hspace{16pt} region\_key\_nation\_name\_pairs = region\_key \blacktriangledown region\_nation\_names # $ # # This expresses that `region_key_nation_name_pairs` is the result of using [cross-union][] with `region_key` and `region_nation_names`. This results in a clan which contains each nation name and its region key. # # $ # \hspace{16pt} region\_key\_nation\_name\_pairs\_accumulator = region\_key\_nation\_name\_pairs\_accumulator \cup region\_key\_nation\_name\_pairs # $ # # This expresses that `region_key_nation_name_pairs_accumulator` is the result of its union with `region_key_nation_name_pairs`. This produces a new clan that is the accumulation of all of the nation name and region key pairs. # # In the implementation we iterate each of the `region` relations in the `regions` clan. For each `region` relation we insert it into a set in order to make it a clan and use projection with 'regionkey' to retrieve this region's `regionkey` as a clan. Next we use the bracket operator with `region` to retrieve a clan of all of the nations' data for this region. Then we reduce that data to a clan of nation name couplets by way of projection and use cross-union to pair the region key clan with the nation name clan. This pairs each nation name couplet with the region key couplet to form a clan with this region's region key and nation name as pairs. Finally the region key and nation name pairs clan is set unioned into a value external to the loop that accumulates all of the region key and nation name pairs' clans generated during each iteration. # # ####`algebraixlib` implementation: # # [diagonal]: # [cross-union]: # [set-union]: # [clans-project]: # In[7]: from algebraixlib.mathobjects import Set from algebraixlib.algebras.clans import project, cross_union from algebraixlib.algebras.sets import union region_key_nation_name_pairs_accumulator = Set() for region in regions: region_key = project(Set(region), 'regionkey') region_nations = region['nation'] region_nation_names = project(region_nations, 'name') region_key_nation_name_pairs = cross_union(region_key, region_nation_names) region_key_nation_name_pairs_accumulator = union(region_key_nation_name_pairs_accumulator, region_key_nation_name_pairs) print('region key nation name pairs:\n' + core(mo_to_str(region_key_nation_name_pairs_accumulator), 600, 8)) # ## All Regions With Nation Named "UNITED STATES" # # For our final query we retrieve the region name of a region which includes a particular nation name. # # ####XQuery Expression # # ```shell # for $x in doc("regions.xml")/regions/region/nation[name="UNITED STATES"] return $x/../name # ``` # # ####Data Algebra Translation # # We can build upon our last example by reusing its resulting collection of region key and nation name pairs. # # $ # \hspace{16pt} us\_region\_key\_nation\_name\_pair = region\_key\_nation\_name\_pairs \blacktriangleright \{\{'name'{\mapsto}\text{'UNITED STATES}\}\} # $ # # This expresses that `us_region_key_nation_name_pair` is the result of using [superstriction][] with `region_key_nation_name_pairs` and the clan with a single element with 'name' for the left component and 'UNITED STATES' for the right component. This has the effect of eliminating all of the region key and nation name pairs except the one with the nation name which is 'UNITED STATES'. # # $ # \hspace{16pt} us\_region\_key = project(us\_region\_key\_nation\_name\_pair, 'regionkey') # $ # # This expresses that `us_region_key` is the result of using projection with the clan `us_region_key_nation_name_pair` and the diagonal with the element 'regionkey'. This produces a clan containing the region key associated with the nation named 'UNITED STATES'. # # $ # \hspace{16pt} us\_region = regions \blacktriangleright us\_region\_key # $ # # This expresses that `us_region` is the result of using superstriction with `regions` and the clan `us_region_key`. This produces a clan with only the region data for the region containing the nation named 'UNITED STATES'. # # $ # \hspace{16pt} us\_region\_name = project(us\_region, 'name') # $ # # This expresses that `us_region_name` is the result of using projection with the clan `us_region` and the diagonal with element 'name'. This extracts the region name of the region containing the nation named 'UNITED STATES'. # # In the implementation we start with a clan created using a set with one relation with one couplet having the left component 'name' and the right component 'UNITED STATES'. This is needed as a clan because it is used in superstriction, a clan operation, with the clan `region_key_nation_name_pairs`. The superstriction is used to extract the region key and nation name which includes the nation named 'UNITED STATES'. From the resulting clan we project out the region keys and use them in a superstriction with `regions` to extract the region corresponding to the region key. Lastly we project out the name for the region that inclues the nation named 'UNITED STATES': 'AMERICA'. # # ####`algebraixlib` Implementation # # [`superstriction`]: # # In[8]: from algebraixlib.mathobjects.couplet import Couplet us_nation_name = from_dict({'name': 'UNITED STATES'}) us_region_key_nation_name_pair = superstrict(region_key_nation_name_pairs_accumulator, us_nation_name) us_region_key = project(us_region_key_nation_name_pair, 'regionkey') us_region = superstrict(regions, us_region_key) us_region_name = project(us_region, 'name') print('us_region_name:\n' + mo_to_str(us_region_name)) # ---- # © Copyright Permission.io, Inc. (formerly known as Algebraix Data Corporation), Copyright (c) 2022. # # This file is part of [`algebraixlib`][] . # # [`algebraixlib`][] is free software: you can redistribute it and/or modify it under the terms of [version 3 of the GNU Lesser General Public License][] as published by the [Free Software Foundation][]. # # [`algebraixlib`][] is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. # # You should have received a copy of the GNU Lesser General Public License along with [`algebraixlib`][]. If not, see [GNU licenses][]. # # [`algebraixlib`]: (A Python library for data algebra) # [Version 3 of the GNU Lesser General Public License]: # [Free Software Foundation]: # [GNU licenses]: