This IPython notebook is part of a series of tutorials that introduce how data algebra facilitates querying data from multiple sources and in different structures, on the example of a modified TPC-H query.
The tutorials assume basic familiarity with our library; we suggest working through our Hello_World tutorial first. They also assume basic knowledge of relational data, RDF and XML.
In some cases, later parts of the tutorial assume knowledge of concepts introduced in earlier parts, so it is best to work through them in the listed sequence:
We converted the tables region
and nation
into a single XML file regions.xml
. In this tutorial, we present our representation for hierarchical data (like [XML][] or [JSON][]), how to import it from XML and how to execute a simple data extraction against this data on the example of a simple XQuery request that is embedded as pseudo-subquery in our modified query 5:
(
-- This is a pseudo-subquery in XQuery. It extracts a list of nations
-- for the desired region, and for every nation it returns an XML 'row'
-- with the columns 'nationkey' and 'nationname' (renamed from 'name').
-- This forms an XML pseudo-'table' named 'nations' that is then joined
-- with 'customer'.
for $x in doc("regions.xml")/regions/region[name="MIDDLE EAST"]/nation
return <nation>{$x/nationkey}<nationname>{data($x/name)}</nationname></nation>
) AS nations```
This query creates a table-like XML structure with the columns `nationkey` and `nationname` (represented by the elements with the same name) in rows (represented by the elements with name `nation`).
[XML]: <http://www.w3.org/TR/REC-xml/> (Extensible Markup Language (XML) 1.0 (Fifth Edition))
[JSON]: <http://json.org/> (Introducing JSON)
We represent hierarchical data (like XML or JSON documents) as nested relations, where the right parts of the couplets in a relation are themselves relations.
An XML element is represented as a couplet, where the left part is the tag name and the right part is the tag content. If the content is only character data, it is represented as an atom in the right part. If the content contains child elements (or the tag contains attributes), it is represented as a set of couplets in the right part, where each such couplet represents a child element, an attribute or the character data.
This representation has a few limitations:
$
. When an element contains character data and doesn't have a child element or an attribute, the character data is represented as the right part of the couplet that represents the element. This results in different representations of character data, depending on other data in the element.Neither of these issues is a problem for our example. All these issues could be resolved, at the cost of a higher complexity of the representation.
regions
XML File¶The tables region
and nation
contain data of a type that is often represented as hierarchical data in an XML document. The world is divided into regions (with a key, a name and a descriptive comment), each of which contains nations (again with a key, a name and a descriptive comment). In the relational (tabular) data, the nations are associated with their region through a regionkey
column in the nation
table. In the hierarchical data, this is not necessary; children are already associated with their parents through the structure of the document.
We created an XML document regions.xml
that contains the data of these two tables, in a very straightforward way. The root node is <regions>
, it contains <region>
elements that contain the region data from the region
table and the associated nation data from the nation
table as <nation>
child elements.
First we import this document. This is the section doc("regions.xml")
in the XQuery statement.
algebraixlib.io.xml
provides utilities for processing XML data.convert_numerics
argument of the xml.import_xml
function tells it to convert character data that looks like a number into a numeric Python type. (Otherwise, it would be a string.)print_var
is a utility that prints our MathObject
s in a somewhat readable text format. The max_length=800
argument tells it to stop creating output at 800 characters; the append='\n'
argument tells it to append a line feed to the output (required for use in notebooks).import algebraixlib.io.xml as xml
from algebraixlib.util.miscellaneous import print_var
regions = xml.import_xml('regions.xml', convert_numerics=True)
print_var('regions', max_length=800, append='\n')
regions(len 1) = {('regions'->{('region'->{('comment'->'ges. thinly even pinto beans ca'), ('name'->'ASIA'), ('nation'->{('comment'->'c dependencies. furiously express notornis sleep slyly regular accounts. ideas sleep. depos'), ('name'->'CHINA'), ('nationkey'->18)}), ('nation'->{('comment'->'hely enticingly express accounts. even, final'), ('name'->'VIETNAM'), ('nationkey'->21)}), ('nation'->{('comment'->'ously. final, express gifts cajole a'), ('name'->'JAPAN'), ('nationkey'->12)}), ('nation'->{('comment'->'slyly express asymptotes. regular deposits haggle slyly. carefully ironic hockey players sleep blithely. carefull'), ('name'->'INDONESIA'), ('nationkey'->9)}), ('nation'->{('comment'->'ss excuses cajole slyly across the packages. deposits print aroun'), ('name'->'INDIA'), ('nationkey'->8)}), ('regionk...
From this data we extract a clan where every relation in the clan represents a region. This is the section doc("regions.xml")/regions/region
in the XQuery statement.
For this, we use two syntactic shortcuts we added to our MathObject
hierarchy: the (left)
and [left]
notations.
(left)
is only defined if the MathObject
is a functional relation. If it is, it returns the right part associated with left
.[left]
is defined if the MathObject
is a relation or a clan. In either case, it returns a set of all right parts associated with left
.Simplified, our XML content looks like this:
<regions>
<region>
...
</region>
<region>
...
</region>
</regions>
We translate this into the following:
$$ Regions = \{regions{\mapsto}\{region{\mapsto}\{...\}, region{\mapsto}\{...\}, ...\}\} $$With this data in the object regions
, the expression regions('regions')
returns the right part of the couplet with a left part regions
, which is a set that contains a number of couplets with left part region
. Applying to this the []
operator (regions('regions')['region']
) returns a set of all the right parts of the couplets with left region
. Since each of these right parts is a relation (a set of couplets) with the region data, the set of these relations is a clan, where each relation represents the data of a single region.
Or, in mathematical notation:
$$ Regions_{Clan} = \{right(R)\ :\ R \in Regions(regions) \text{ and } left(R) = region \} $$regions_clan = regions('regions')['region']
print_var('regions_clan', max_length=800, append='\n')
regions_clan(len 5) = {{('comment'->'ges. thinly even pinto beans ca'), ('name'->'ASIA'), ('nation'->{('comment'->'c dependencies. furiously express notornis sleep slyly regular accounts. ideas sleep. depos'), ('name'->'CHINA'), ('nationkey'->18)}), ('nation'->{('comment'->'hely enticingly express accounts. even, final'), ('name'->'VIETNAM'), ('nationkey'->21)}), ('nation'->{('comment'->'ously. final, express gifts cajole a'), ('name'->'JAPAN'), ('nationkey'->12)}), ('nation'->{('comment'->'slyly express asymptotes. regular deposits haggle slyly. carefully ironic hockey players sleep blithely. carefull'), ('name'->'INDONESIA'), ('nationkey'->9)}), ('nation'->{('comment'->'ss excuses cajole slyly across the packages. deposits print aroun'), ('name'->'INDIA'), ('nationkey'->8)}), ('regionkey'->2)}, {('comment'->'...
We are now back in 'clan land' and can use the same techniques we used in 2-Tables and more specifically 3-Graphs - The supplier Graph. To extract the data of a region with a given name, we use superstriction:
$$ Region_{Target} = Regions_{Clan} \blacktriangleright \{\{name{\mapsto}\text{'MIDDLE EAST'}\}\} $$This implements the section doc("regions.xml")/regions/region[name="MIDDLE EAST"]
of the XQuery statement.
algebraixlib.algebras.clans
contains the functions that are related to our algebra of clans.clans.superstrict
executes the clan superstriction ($\blacktriangleright$).iprint_latex
is a utility that prints our MathObject
s in LaTeX format in IPython notebooks. The short=True
argument tells it to create abbreviated output.import algebraixlib.algebras.clans as clans
from algebraixlib.util.latexprinter import iprint_latex
target_region = clans.superstrict(regions_clan, clans.from_dict({'name': 'MIDDLE EAST'}))
iprint_latex('target_region', short=True)
Again using the [left]
notation, we extract a set of all nation relations (a clan) from this.
In set notation:
$$ Nations = \{right(T)\ :\ T \in Region_{Target} \text{ and } left(T) = nation \} $$With this, we're up to doc("regions.xml")/regions/region[name="MIDDLE EAST"]/nation
of the XQuery statement.
nations_clan = target_region['nation']
iprint_latex('nations_clan', short=True)
We are only interested in the columns nationkey
and name
(not in comment
), and we want to rename name
to nationname
. All of this can be accomplished by a composition:
This then implements the full XQuery:
for $x in doc("regions.xml")/regions/region[name="MIDDLE EAST"]/nation
return <nation>{$x/nationkey}<nationname>{data($x/name)}</nationname></nation>
It also provides the output of the XQuery as a clan (which was the intention of the XQuery statement).
clans.compose
executes the clan composition ($\circ$).nations = clans.compose(nations_clan, clans.from_dict({'nationkey': 'nationkey', 'nationname': 'name'}))
iprint_latex('nations', short=True)
© Copyright Permission.io, Inc. (formerly known as Algebraix Data Corporation), Copyright (c) 2022.
This file is part of algebraixlib
.
algebraixlib
is free software: you can redistribute it and/or modify it under the terms of version 3 of the GNU Lesser General Public License as published by the Free Software Foundation.
algebraixlib
is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public License along with algebraixlib
. If not, see GNU licenses.