#!/usr/bin/env python # coding: utf-8 # # TPC-H Query 5 – Hierarchies # # This IPython notebook is part of a series of tutorials that introduce how data algebra facilitates querying data from multiple sources and in different structures, on the example of a modified [TPC-H][] query. # # The tutorials assume basic familiarity with our library; we suggest working through our [Hello_World][] tutorial first. They also assume basic knowledge of relational data, RDF and XML. # # In some cases, later parts of the tutorial assume knowledge of concepts introduced in earlier parts, so it is best to work through them in the listed sequence: # # - [1-Introduction][]: Introduces this series of tutorials, TPC-H, the TPC-H query 5 and our modifications to it. # - [2-Tables][]: Introduces our representation of tabular data, on the example of CSV. # - [3-Graphs][]: Introduces our representation of RDF-style tabular graph data, on the example of Turtle. # - **[4-Hierarchies][] (this tutorial)**: Introduces our representation of hierarchical data, on the example of XML. # - [5-Query][]: Brings it all together and explains the whole query. # # [TPC-H]: (TPC-H Benchmark Main Page) # [Hello_World]: <../Hello_World.ipynb> (IPython Notebook: Hello World) # [1-Introduction]: <1-Introduction.ipynb> (IPython Notebook: TPC-H Query 5 - Introduction) # [2-Tables]: <2-Tables.ipynb> (IPython Notebook: TPC-H Query 5 - Tables) # [3-Graphs]: <3-Graphs.ipynb> (IPython Notebook: TPC-H Query 5 - Graphs) # [4-Hierarchies]: <4-Hierarchies.ipynb> (IPython Notebook: TPC-H Query 5 - Hierarchies) # [5-Query]: <5-Query.ipynb> (IPython Notebook: TPC-H Query 5 - Query) # ## XML Data In the Query # # We converted the tables `region` and `nation` into a single XML file `regions.xml`. In this tutorial, we present our representation for hierarchical data (like [XML][] or [JSON][]), how to import it from XML and how to execute a simple data extraction against this data on the example of a simple XQuery request that is embedded as pseudo-subquery in our modified query 5: # # ``` sql # ( # -- This is a pseudo-subquery in XQuery. It extracts a list of nations # -- for the desired region, and for every nation it returns an XML 'row' # -- with the columns 'nationkey' and 'nationname' (renamed from 'name'). # -- This forms an XML pseudo-'table' named 'nations' that is then joined # -- with 'customer'. # for $x in doc("regions.xml")/regions/region[name="MIDDLE EAST"]/nation # return {$x/nationkey}{data($x/name)} # ) AS nations``` # # This query creates a table-like XML structure with the columns `nationkey` and `nationname` (represented by the elements with the same name) in rows (represented by the elements with name `nation`). # # [XML]: (Extensible Markup Language (XML) 1.0 (Fifth Edition)) # [JSON]: (Introducing JSON) # ## Representation of Hierarchical Data # # We represent hierarchical data (like [XML][] or [JSON][] documents) as nested relations, where the right parts of the couplets in a relation are themselves relations. # # An XML element is represented as a couplet, where the left part is the tag name and the right part is the tag content. If the content is only character data, it is represented as an atom in the right part. If the content contains child elements (or the tag contains attributes), it is represented as a set of couplets in the right part, where each such couplet represents a child element, an attribute or the character data. # # This representation has a few limitations: # # - We do not distinguish between child elements and attributes of a given element. Both are represented in the same way. # - We do not maintain the order of the child elements. The order of elements in the document is lost in the representation. # - We do not allow multiple child elements with matching tag names and contents. # - When an element contains character data and has a child element or an attribute, the character data is represented as if it were a child element with tag name `$`. When an element contains character data and doesn't have a child element or an attribute, the character data is represented as the right part of the couplet that represents the element. This results in different representations of character data, depending on other data in the element. # # Neither of these issues is a problem for our example. All these issues could be resolved, at the cost of a higher complexity of the representation. # # [XML]: (Extensible Markup Language (XML) 1.0 (Fifth Edition)) # [JSON]: (Introducing JSON) # # The `regions` XML File # # The tables `region` and `nation` contain data of a type that is often represented as hierarchical data in an XML document. The world is divided into regions (with a key, a name and a descriptive comment), each of which contains nations (again with a key, a name and a descriptive comment). In the relational (tabular) data, the nations are associated with their region through a `regionkey` column in the `nation` table. In the hierarchical data, this is not necessary; children are already associated with their parents through the structure of the document. # # We created an XML document `regions.xml` that contains the data of these two tables, in a very straightforward way. The root node is ``, it contains `` elements that contain the region data from the `region` table and the associated nation data from the `nation` table as `` child elements. # # First we import this document. This is the section `doc("regions.xml")` in the XQuery statement. # # - [`algebraixlib.io.xml`][] provides utilities for processing XML data. # - The `convert_numerics` argument of the [`xml.import_xml`][] function tells it to convert character data that looks like a number into a numeric Python type. (Otherwise, it would be a string.) # - [`print_var`][] is a utility that prints our [`MathObject`][]s in a somewhat readable text format. The `max_length=800` argument tells it to stop creating output at 800 characters; the `append='\n'` argument tells it to append a line feed to the output (required for use in notebooks). # # [`algebraixlib.io.xml`]: (module algebraixlib.io.xml) # [`xml.import_xml`]: (function algebraixlib.io.xml.import_xml) # [`print_var`]: (function algebraixlib.util.miscellaneous.print_var) # [`MathObject`]: (class algebraixlib.mathobjects.mathobject.MathObject) # In[1]: import algebraixlib.io.xml as xml from algebraixlib.util.miscellaneous import print_var regions = xml.import_xml('regions.xml', convert_numerics=True) print_var('regions', max_length=800, append='\n') # From this data we extract a clan where every relation in the clan represents a region. This is the section `doc("regions.xml")/regions/region` in the XQuery statement. # # For this, we use two syntactic shortcuts we added to our `MathObject` hierarchy: the `(left)` and `[left]` notations. # # - `(left)` is only defined if the `MathObject` is a functional relation. If it is, it returns the right part associated with `left`. # - `[left]` is defined if the `MathObject` is a relation or a clan. In either case, it returns a set of all right parts associated with `left`. # # Simplified, our XML content looks like this: # # # # ... # # # ... # # # # We translate this into the following: # # $$ # Regions = \{regions{\mapsto}\{region{\mapsto}\{...\}, region{\mapsto}\{...\}, ...\}\} # $$ # # With this data in the object `regions`, the expression `regions('regions')` returns the right part of the couplet with a left part `regions`, which is a set that contains a number of couplets with left part `region`. Applying to this the `[]` operator (`regions('regions')['region']`) returns a set of all the right parts of the couplets with left `region`. Since each of these right parts is a relation (a set of couplets) with the region data, the set of these relations is a clan, where each relation represents the data of a single region. # # Or, in mathematical notation: # # $$ # Regions_{Clan} = \{right(R)\ :\ R \in Regions(regions) \text{ and } left(R) = region \} # $$ # In[2]: regions_clan = regions('regions')['region'] print_var('regions_clan', max_length=800, append='\n') # We are now back in 'clan land' and can use the same techniques we used in [2-Tables][] and more specifically [3-Graphs - The supplier Graph][]. To extract the data of a region with a given name, we use superstriction: # # $$ # Region_{Target} = Regions_{Clan} \blacktriangleright \{\{name{\mapsto}\text{'MIDDLE EAST'}\}\} # $$ # # This implements the section `doc("regions.xml")/regions/region[name="MIDDLE EAST"]` of the XQuery statement. # # - [`algebraixlib.algebras.clans`][] contains the functions that are related to our algebra of clans. # - [`clans.superstrict`][] executes the clan superstriction ($\blacktriangleright$). # - [`iprint_latex`][] is a utility that prints our [`MathObject`][]s in LaTeX format in IPython notebooks. The `short=True` argument tells it to create abbreviated output. # # [2-Tables]: <2-Tables.ipynb> (IPython Notebook: TPC-H Query 5 - Tables) # [3-Graphs - The supplier Graph]: <3-Graphs.ipynb#The-supplier-Graph> (IPython Notebook: TPC-H Query 5 - Graphs - The supplier Graph) # [`algebraixlib.algebras.clans`]: (module algebraixlib.algebras.clans) # [`clans.superstrict`]: (function algebraixlib.algebras.clans.superstrict) # [`iprint_latex`]: (function algebraixlib.util.latexprinter.iprint_latex) # [`MathObject`]: (class algebraixlib.mathobjects.mathobject.MathObject) # In[3]: import algebraixlib.algebras.clans as clans from algebraixlib.util.latexprinter import iprint_latex target_region = clans.superstrict(regions_clan, clans.from_dict({'name': 'MIDDLE EAST'})) iprint_latex('target_region', short=True) # Again using the `[left]` notation, we extract a set of all nation relations (a clan) from this. # # In set notation: # # $$ # Nations = \{right(T)\ :\ T \in Region_{Target} \text{ and } left(T) = nation \} # $$ # # With this, we're up to `doc("regions.xml")/regions/region[name="MIDDLE EAST"]/nation` of the XQuery statement. # In[4]: nations_clan = target_region['nation'] iprint_latex('nations_clan', short=True) # We are only interested in the columns `nationkey` and `name` (not in `comment`), and we want to rename `name` to `nationname`. All of this can be accomplished by a composition: # # $$ # Nations_{Proj} = Nations \circ \{\{nationkey{\mapsto}nationkey, nationname{\mapsto}name\}\} # $$ # # This then implements the full XQuery: # # for $x in doc("regions.xml")/regions/region[name="MIDDLE EAST"]/nation # return {$x/nationkey}{data($x/name)} # # It also provides the output of the XQuery as a clan (which was the intention of the XQuery statement). # # - [`clans.compose`][] executes the clan composition ($\circ$). # # [`clans.compose`]: (function algebraixlib.algebras.clans.compose) # In[5]: nations = clans.compose(nations_clan, clans.from_dict({'nationkey': 'nationkey', 'nationname': 'name'})) iprint_latex('nations', short=True) # # Last Step # # Finish this series of tutorials with [5-Query][]; it brings it all together and explains the whole query. # # [5-Query]: <5-Query.ipynb> (IPython Notebook: TPC-H Query 5 - Query) # ---- # © Copyright Permission.io, Inc. (formerly known as Algebraix Data Corporation), Copyright (c) 2022. # # This file is part of [`algebraixlib`][] . # # [`algebraixlib`][] is free software: you can redistribute it and/or modify it under the terms of [version 3 of the GNU Lesser General Public License][] as published by the [Free Software Foundation][]. # # [`algebraixlib`][] is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. # # You should have received a copy of the GNU Lesser General Public License along with [`algebraixlib`][]. If not, see [GNU licenses][]. # # [`algebraixlib`]: (A Python library for data algebra) # [Version 3 of the GNU Lesser General Public License]: # [Free Software Foundation]: # [GNU licenses]: