This IPython notebook is part of a series of tutorials that introduce how data algebra facilitates querying data from multiple sources and in different structures, on the example of a modified TPC-H query.
The tutorials assume basic familiarity with our library; we suggest working through our Hello_World tutorial first. They also assume basic knowledge of relational data, RDF and XML.
In some cases, later parts of the tutorial assume knowledge of concepts introduced in earlier parts, so it is best to work through them in the listed sequence:
In our modified query 5, the tables customer
, orders
and lineitem
are provided as CSV data, the original tables region
and nation
are provided as XML data and the original table supplier
is provided as RDF graph. In this tutorial, we present our representation of tabular data, how to import it from the CSV format (see also RFC 4180) and how to execute simple operations on it like project
and restrict
.
As examples, we use two subqueries from the modified query 5, one slightly modified. We treat them here as stand-alone queries, so that it is easier to understand what they do.
customers_joined_projected
Subquery¶Our first example query loads the table customer
and selects the columns custkey
and nationkey
for all customers in the nations given by the table nations
. In the full query, the result of this subquery is then referenced as customers_joined_projected
.
(
SELECT
custkey, nationkey, nationname
FROM
customer
JOIN
nations
ON
customer.nationkey = nations.nationkey
) AS customers_joined_projected
This example is a simplified version of the following subquery in our modified query 5. To keep this example simple, we replaced the pseudo-subquery (in XQuery) that provides the result that is named nations
with a hardcoded table nations
. The translation of the XQuery subquery is shown later, in 4-Hierarchies.
(
SELECT
custkey, nationkey, nationname
FROM
customer
JOIN (
-- This is a pseudo-subquery in XQuery. It extracts a list of nations
-- for the desired region, and for every nation it returns an XML 'row'
-- with the columns 'nationkey' and 'nationname' (renamed from 'name').
-- This forms an XML pseudo-'table' named 'nations' that is then joined
-- with 'customer'.
for $x in doc("regions.xml")/regions/region[name="MIDDLE EAST"]/nation
return <nation>{$x/nationkey}<nationname>{data($x/name)}</nationname></nation>
) AS nations
ON
customer.nationkey = nations.nationkey
) AS customers_joined_projected
orders_restricted_projected
Subquery¶Our second example query loads the table orders
and selects the columns orderkey
and custkey
for all orders in the given date range. In the full query, the result of this subquery is then referenced as orders_restricted_projected
.
(
SELECT
orderkey, custkey
FROM
orders
WHERE
startdate <= orders.orderdate and orders.orderdate < enddate
) AS orders_restricted_projected
This example is taken directly from the corresponding subquery in our modified query 5.
We represent tables, rows and columns as constructs built of sets and couplets.
Mathematical concepts:
Data model:
A table can be modeled as a clan. In detail:
All three constructs (couplets/cells, relations/rows, clans/tables) are associated with a number of operations. See also our Hello_World introduction.
customer
Table¶In the first example query, we import the CSV table customer
, select the customers in a given set of nations nationkeys
and project the useful columns custkey
, nationkey
and nationname
. The associated SQL query looks like this:
SELECT
custkey, nationkey, nationname
FROM
customer
JOIN
nations
ON
customer.nationkey = nations.nationkey
Below follows example code that shows how to use our API to import the table customer
from a CSV file into our MathObject
representation.
algebraixlib.io.csv
provides utilities for importing and exporting CSV data.iprint_latex
is a utility that prints our MathObject
s in LaTeX format in IPython notebooks. The short=True
argument tells it to create abbreviated output. It shortenes long strings, indicated by the appended ellipses '...'. It also only prints the first few elements of sets and replaces the remainder with an ellipsis, followed by a number in parentheses that indicates how many objects have been ommitted.customer_types
defines the types for columns that are not strings. (It does not select columns to be imported; all columns not listed here are imported as strings.) It is passed as argument to import_csv
, which then processes the indicated CSV columns accordingly.import algebraixlib.io.csv as csv
from algebraixlib.util.latexprinter import iprint_latex
customer_types = {'custkey': int, 'nationkey': int, 'acctbal': float}
customers = csv.import_csv('customer.csv', customer_types)
iprint_latex('customers', short=True)
Next we create a mockup of the result of the pseudo-subquery in XQuery (which is explained in 4-Hierarchies). nations
is a clan (table) with the lefts (columns) nationkey
and nationname
and for our mockup we add three rows/nations.
A clan is a Set
of Set
s of Couplet
s (a set of relations).
algebraixlib.mathobjects
provides the classes that represent our data.from algebraixlib.mathobjects import Set, Couplet
nations = Set(Set(Couplet('nationkey', 4), Couplet('nationname', 'EGYPT')),
Set(Couplet('nationkey', 10), Couplet('nationname', 'IRAN')),
Set(Couplet('nationkey', 11), Couplet('nationname', 'IRAQ')))
iprint_latex('nations')
We now join the two tables. They have only one column with matching names, which is the column on which they are to be joined. (While this is a lucky coincidence, the principle of the operation wouldn't be affected if we had to rename a column or two.)
Mathematically, we represent such a join (where the column names align just right) as functional union:
$$ \begin{align*} \text{Functional union}:\ &R \underset{f}{\cup} Q &&:=\begin{cases} R \cup Q & \text{ if } R \cup Q \text{ is functional } \\ \text{undefined} & \text{ otherwise } \end{cases} & \text{ for } R, Q \in P(M \times M)\\ \text{Functional cross-union}:\ &\mathbb{C} \underset{f}{\blacktriangledown} \mathbb{D} &&:= \{R \underset{f}{\cup} Q\ : R \in \mathbb{C} \text{ and } Q \in \mathbb{D}\} & \text{ for } \mathbb{C}, \mathbb{D} \in P^2(M \times M) \end{align*} $$With this we get:
$$ CustomersNations = Customer \underset{f}{\blacktriangledown} Nations $$Since only a part of the relations in the customers
clan have a couplet with a left part of nationkey
and a right that matches a right in the nations
clan`s nationkey
lefts, the number of relations goes down from 150 (in the customers
clan) to 19 (in the result of the join).
algebraixlib.algebras.clans
contains the functions that are related to our algebra of clans.clans.cross_functional_union
executes the functional cross-union ($\underset{f}{\blacktriangledown}$).import algebraixlib.algebras.clans as clans
customers_nations = clans.cross_functional_union(customers, nations)
iprint_latex('customers_nations', short=True)
We then compose the result to project the desired columns.
When composing a clan with a clan diagonal – a clan with a single relation that contains only couplets where left and right are the same – we remove all couplets with left parts not present in the clan diagonal from all relations, effectively projecting the columns present in the clan diagonal.
$$ CustomersNations_{Proj} = CustomersNations \circ \{\{custkey{\mapsto}custkey, nationkey{\mapsto}nationkey, nationname{\mapsto}nationname\}\} $$Composition on couplets, relations and clans is defined as follows:
$$ \begin{align*} \text{Couplet composition}:\ &c{\mapsto}d \circ a{\mapsto}b &&:= \begin{cases} a{\mapsto}d & \text{if } b = c \\ \text{undefined} & \text{otherwise} \end{cases} &\text{for } a, b, c, d \in M \\ \text{Relation composition}:\ &Rel_2 \circ Rel_1 &&:= \{cp_2 \circ cp_1\ :\ cp_1 \in Rel_1,\ cp_2 \in Rel_2\} &\text{for } Rel_1,\ Rel_2 \in P(M \times M) \\ \text{Clan composition}:\ &Clan_2 \circ Clan_1 &&:= \{Rel_2 \circ Rel_1\ :\ Rel_1 \in Clan_1,\ Rel_2 \in Clan_2\} &\text{ for } Clan_1,\ Clan_2 \in P^2(M \times M) \\ \end{align*} $$The number of relations in the clan (19) doesn't change, but the number of couplets in each relation is reduced to the ones that have the desired column names as lefts.
clans.compose
executes the clan composition ($\circ$).customers_nations_composed = clans.compose(customers_nations,
Set(Set(Couplet('custkey', 'custkey'),
Couplet('nationkey', 'nationkey'),
Couplet('nationname', 'nationname'))))
iprint_latex('customers_nations_composed', short=True)
This operation (a composition of a clan with a clan diagonal) is so common that we have a wrapper function for it in the clans
algebra:
clans.project
accepts a clan as argument, followed by one or more atoms. It creates a clan diagonal from these atoms and composes the clan with it, effectively projecting the columns passed in as arguments.customers_nations_projected = clans.project(customers_nations, 'custkey', 'nationkey', 'nationname')
iprint_latex('customers_nations_projected', short=True)
orders
Table¶In the second example query of this notebook, we import the CSV table orders
, select the orders in a given date range (from startdate
to enddate
) and project the useful columns orderkey
and custkey
. The associated SQL query looks like this:
SELECT
orderkey, custkey
FROM
orders
WHERE
startdate <= orders.orderdate and orders.orderdate < enddate
First, we again import the table orders
from a CSV file.
order_types
again defines the types for columns that need special handling. All columns not listed here are imported as strings.read_date
shows how custom functions can be used instead of standard types to perform special conversions, on the example of the column orderdate
. For each row, the function receives the CSV column value as string and returns what is needed (a datetime
in this case).print('The display of this data may take a while. 15 s is not unusual... \n', flush=True)
from datetime import datetime
def read_date(date_str: str) -> datetime:
return datetime.strptime(date_str, '%Y-%m-%d').date()
order_types = {
'orderkey': int, 'custkey': int, 'orderdate': read_date,
'totalprice': float, 'shippriority': int
}
orders = csv.import_csv('orders.csv', order_types)
iprint_latex('orders', short=True)
The display of this data may take a while. 15 s is not unusual...
Then we restrict the clan orders
using the function select_dates
. It returns True
if the right part associated with the left orderdate
in the relation rel
falls between startdate
(inclusive) and enddate
(exclusive).
As used here, our function sets.restrict
implements a set builder of this form:
The restriction reduces the number of relations in the clan from TODO to 16 (for the given date range).
import algebraixlib.algebras.sets as sets
startdate = datetime(1994, 1, 1).date()
enddate = datetime(1994, 2, 1).date()
def select_dates(rel) -> bool:
orderdate = rel('orderdate').value
return (startdate <= orderdate) and (orderdate < enddate)
orders_restricted = sets.restrict(orders, select_dates)
iprint_latex('orders_restricted', short=True)
Finally we again project the desired columns orderkey
and custkey
, using the function clans.project
:
This again doesn't change the number of relations in the clan (16), but the number of couplets in each relation is reduced to the ones that have the desired column name as left.
orders_restricted_projected = clans.project(orders_restricted, 'orderkey', 'custkey')
iprint_latex('orders_restricted_projected', short=True)
© Copyright Permission.io, Inc. (formerly known as Algebraix Data Corporation), Copyright (c) 2022.
This file is part of algebraixlib
.
algebraixlib
is free software: you can redistribute it and/or modify it under the terms of version 3 of the GNU Lesser General Public License as published by the Free Software Foundation.
algebraixlib
is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public License along with algebraixlib
. If not, see GNU licenses.