This IPython notebook is part of a series of tutorials that introduce how data algebra facilitates querying data from multiple sources and in different structures, on the example of a modified TPC-H query.
The tutorials assume basic familiarity with our library; we suggest working through our Hello_World tutorial first. They also assume basic knowledge of relational data, RDF and XML.
In some cases, later parts of the tutorial assume knowledge of concepts introduced in earlier parts, so it is best to work through them in the listed sequence:
We converted the table supplier
from the original data into an RDF graph supplier
. In this tutorial, we present our clan representation of RDF graphs, how to import them from N-Triples and Turtle formats and how to execute simple SPARQL matching (in the form of basic graph patterns) on the example of the SPARQL 'subquery' that is embedded in our modified query 5:
(
# This is a pseudo-subquery in SPARQL. It extracts a table with the
# columns 'suppkey' and 'nationkey' from the RDF graph 'supplier'
# and injects them into the outer query as table 'supplier'.
SELECT
?suppkey, ?nationkey
FROM
supplier
WHERE {
?supplier <tpch:suppkey> ?suppkey .
?supplier <tpch:nationkey> ?nationkey .
}
) AS supplier_solutions
This query creates a table with the columns suppkey
and nationkey
that associates supplier keys with the nation keys of the suppliers' nation.
The common RDF graph serialization formats provide the graph data in tabular form, with three columns that represent subject, predicate and object. This applies most clearly to the N-Triples format; Turtle and other formats provide syntactic shortcuts that allow for example repetition of subject over predicate/object lists or subject and predicate over object lists, but this doesn't invalidate the principle.
With this, our internal representation is not different from the one we use for tables; such a graph is represented as a regular clan with the lefts s
, p
and o
(representing subject, predicate and object, respectively). The description of our internal representation of tables applies here as well. What makes RDF graphs different from 'normal' tables are the operations that are used for the common questions.
Our Turtle import function is not a full-featured RDF implementation. The most important limitations are:
15
are converted into integer atoms, plain strings like "str"
are converted into string atoms, and IRIs like <scheme:path>
are converted into rdflib.URIRef
instances. There is no support for the IRI-based RDF type system.supplier
Graph¶We used a simple, straightforward method to convert the original supplier
table into an RDF graph: subjects in the graph represent rows and are IRIs derived from the suppkey
column value (the primary key), predicates represent columns and are IRIs derived from the column name, and the objects are the cell values for a given subject (row) and predicate (column).
First we import the RDF graph supplier
from a Turtle file supplier.ttl
.
algebraixlib.io.rdf
provides utilities for processing RDF-style graph data.iprint_latex
is a utility that prints our MathObject
s in LaTeX format in IPython notebooks. The short=True
argument tells it to create abbreviated output.import algebraixlib.io.rdf as rdf
from algebraixlib.util.latexprinter import iprint_latex
suppliers = rdf.import_graph('supplier.ttl')
iprint_latex('suppliers', short=True)
The main selection and combination mechanism in SPARQL are sequences of basic graph patterns (BGPs) like "?supplier <tpch:suppkey> ?suppkey
". Such a BGP is a sequence of two operations:
<tpch:suppkey>
in the couplet with left part p
.s
, p
or o
lefts and are renamed to a variable name. In our first BGP, we need both the s
and the o
lefts and we name them supplier
resp. suppkey
.A superstriction is defined on sets (relations) and sets of sets (clans) as follows:
$$ \begin{align*} \text{Set superstriction}:\ &S \vartriangleright T &&:= \begin{cases} S &\text{if } S \supset T \\ \text{undefined} &\text{otherwise} \end{cases} &\text{for } S, T \in P(M)\\ \text{Set of sets superstriction}:\ &\mathbb{S} \blacktriangleright \mathbb{T} &&:= \{X \vartriangleright Y\ : X \in \mathbb{S} \text{ and } Y \in \mathbb{T}\} &\text{for } \mathbb{S}, \mathbb{T} \in P^2(M) \end{align*} $$The clan (set of sets) superstriction restricts the members of one clan to the ones that are a superset of a member of the other clan.
Composition essentially selects certain couplets and changes their lefts or rights. In our case, we want to select and rename the lefts s
and o
and rename them to supplier
and suppkey
, respectively.
With this, the execution of a BGP results in a clan, where the variable names in the BGP are the lefts in the clan. (We strip the leading question mark from the variable names; this is merely an artifact of the SPARQL syntax.)
In data algebra, we get this:
$$ \begin{align*} BGP_{suppkey,matches} &= Supplier \blacktriangleright \{\{p{\mapsto}\text{<tpch:suppkey>}\}\} \\ BGP_{suppkey} &= BGP_{suppkey,matches} \circ \{\{supplier{\mapsto}s, suppkey{\mapsto}o\}\} \end{align*} $$rdflib
, rdflib.URIRef
: IRIs in our RDF graphs are represented as atoms with a value type of rdflib.URIRef
, so the IRI <tpch:suppkey>
becomes rdflib.URIRef('tpch:suppkey')
in the code.algebraixlib.algebras.clans
contains the functions that are related to our algebra of clans.clans.superstrict
executes the clan superstriction ($\blacktriangleright$).clans.compose
executes the clan composition ($\circ$).clans.from_dict
creates a clan with a single relation. The relation represents the data in the dict
argument, converting the keys to lefts and the values to rights.import rdflib
import algebraixlib.algebras.clans as clans
bgp_suppkey_matches = clans.superstrict(suppliers, clans.from_dict({'p': rdflib.URIRef('tpch:suppkey')}))
bgp_suppkey = clans.compose(bgp_suppkey_matches, clans.from_dict({'supplier': 's', 'suppkey': 'o'}))
iprint_latex('bgp_suppkey', short=True)
For converting the second BGP "?supplier <tpch:nationkey> ?nationkey
", we use the same technique, only this time we remove the intermediate result ($BGP_{suppkey,matches}$ in the equation above):
In the code we did something similar, removing the intermediate variable (bgp_suppkey_matches
in the code above):
bgp_nationkey = clans.compose(clans.superstrict(suppliers, clans.from_dict({'p': rdflib.URIRef('tpch:nationkey')})),
clans.from_dict({'supplier': 's', 'nationkey': 'o'}))
iprint_latex('bgp_nationkey', short=True)
The last step of processing a sequence of BGPs is to join the clans they generated. A solution is defined as a set where every variable has exactly one value, no matter in how many BGPs it appears. Mathematically, we represent this as functional cross-union (similar to an inner join in SQL).
The implicit join of all BGPs to create a set of solutions is represented by a functional cross-union of the individual BGP result clans. It is again a clan, with a column for every variable present in any of the BGPs.
Typically, there are variables in a query that are only needed for the join and that are not part of the query result (?supplier
in our example). To get rid of these, we project the result just as we did in the customer
Table example.
In data algebra:
$$ Supplier_{Solutions} = (BGP_{suppkey} \underset{f}{\blacktriangledown} BGP_{nationkey}) \circ \{\{nationkey{\mapsto}nationkey, suppkey{\mapsto}suppkey\}\} $$clans.project
executes the composition of a clan with a clan diagonal.clans.cross_functional_union
executes the functional cross-union of two clans.supplier_solutions = clans.project(
clans.cross_functional_union(bgp_suppkey, bgp_nationkey), 'nationkey', 'suppkey')
iprint_latex('supplier_solutions', short=True)
Continue with 4-Hierarchies; it introduces our representation of hierarchical data, on the example of XML.
© Copyright Permission.io, Inc. (formerly known as Algebraix Data Corporation), Copyright (c) 2022.
This file is part of algebraixlib
.
algebraixlib
is free software: you can redistribute it and/or modify it under the terms of version 3 of the GNU Lesser General Public License as published by the Free Software Foundation.
algebraixlib
is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public License along with algebraixlib
. If not, see GNU licenses.