#!/usr/bin/env python
# coding: utf-8
# # TPC-H Query 5 – Graphs as Clans
#
# This IPython notebook is part of a series of tutorials that introduce how data algebra facilitates querying data from multiple sources and in different structures, on the example of a modified [TPC-H][] query.
#
# The tutorials assume basic familiarity with our library; we suggest working through our [Hello_World][] tutorial first. They also assume basic knowledge of relational data, RDF and XML.
#
# In some cases, later parts of the tutorial assume knowledge of concepts introduced in earlier parts, so it is best to work through them in the listed sequence:
#
# - [1-Introduction][]: Introduces this series of tutorials, TPC-H, the TPC-H query 5 and our modifications to it.
# - [2-Tables][]: Introduces our representation of tabular data, on the example of CSV.
# - **[3-Graphs][] (this tutorial)**: Introduces our representation of RDF-style tabular graph data, on the example of Turtle.
# - [4-Hierarchies][]: Introduces our representation of hierarchical data, on the example of XML.
# - [5-Query][]: Brings it all together and explains the whole query.
#
# [TPC-H]: (TPC-H Benchmark Main Page)
# [Hello_World]: <../Hello_World.ipynb> (IPython Notebook: Hello World)
# [1-Introduction]: <1-Introduction.ipynb> (IPython Notebook: TPC-H Query 5 - Introduction)
# [2-Tables]: <2-Tables.ipynb> (IPython Notebook: TPC-H Query 5 - Tables)
# [3-Graphs]: <3-Graphs.ipynb> (IPython Notebook: TPC-H Query 5 - Graphs)
# [4-Hierarchies]: <4-Hierarchies.ipynb> (IPython Notebook: TPC-H Query 5 - Hierarchies)
# [5-Query]: <5-Query.ipynb> (IPython Notebook: TPC-H Query 5 - Query)
# ## Graph Data In the Query
#
# We converted the table `supplier` from the original data into an RDF graph `supplier`. In this tutorial, we present our clan representation of RDF graphs, how to import them from [N-Triples][] and [Turtle][] formats and how to execute simple SPARQL matching (in the form of basic graph patterns) on the example of the SPARQL 'subquery' that is embedded in our modified query 5:
#
# ``` sparql
# (
# # This is a pseudo-subquery in SPARQL. It extracts a table with the
# # columns 'suppkey' and 'nationkey' from the RDF graph 'supplier'
# # and injects them into the outer query as table 'supplier'.
# SELECT
# ?suppkey, ?nationkey
# FROM
# supplier
# WHERE {
# ?supplier ?suppkey .
# ?supplier ?nationkey .
# }
# ) AS supplier_solutions
# ```
#
# This query creates a table with the columns `suppkey` and `nationkey` that associates supplier keys with the nation keys of the suppliers' nation.
#
# [N-Triples]: (RDF 1.1 N-Triples - A line-based syntax for an RDF graph)
# [Turtle]: (RDF 1.1 Turtle - Terse RDF Triple Language)
# ## Representation of RDF Graphs as Clans
#
# The common RDF graph serialization formats provide the graph data in tabular form, with three columns that represent subject, predicate and object. This applies most clearly to the [N-Triples][] format; [Turtle][] and other formats provide syntactic shortcuts that allow for example repetition of subject over predicate/object lists or subject and predicate over object lists, but this doesn't invalidate the principle.
#
# With this, our internal representation is not different from the one we use for tables; such a graph is represented as a [regular clan][] with the lefts `s`, `p` and `o` (representing subject, predicate and object, respectively). The description of our internal [representation of tables][] applies here as well. What makes RDF graphs different from 'normal' tables are the operations that are used for the common questions.
#
# Our Turtle import function is not a full-featured RDF implementation. The most important limitations are:
#
# - Escape sequences are not fully supported.
# - Unicode may or may not work.
# - No support for blank nodes.
# - No support for predicate-object lists and object lists.
# - Type support: Plain integers like `15` are converted into integer atoms, plain strings like `"str"` are converted into string atoms, and IRIs like `` are converted into `rdflib.URIRef` instances. There is no support for the IRI-based RDF type system.
#
#
# [N-Triples]: (RDF 1.1 N-Triples - A line-based syntax for an RDF graph)
# [Turtle]: (RDF 1.1 Turtle - Terse RDF Triple Language)
# [regular clan]: ((left-)regular clan)
# [Representation of Tables]: <2-Tables.ipynb#Representation-of-Tables> (IPython Notebook: TPC-H Query 5 - Tables - Representation of Tables)
# # The `supplier` Graph
#
# We used a simple, straightforward method to convert the original `supplier` table into an RDF graph: subjects in the graph represent rows and are IRIs derived from the `suppkey` column value (the primary key), predicates represent columns and are IRIs derived from the column name, and the objects are the cell values for a given subject (row) and predicate (column).
#
# First we import the RDF graph `supplier` from a Turtle file `supplier.ttl`.
#
# - [`algebraixlib.io.rdf`][] provides utilities for processing RDF-style graph data.
# - [`iprint_latex`][] is a utility that prints our [`MathObject`][]s in LaTeX format in IPython notebooks. The `short=True` argument tells it to create abbreviated output.
#
# [`algebraixlib.io.rdf`]: (module algebraixlib.io.rdf)
# [`iprint_latex`]: (function algebraixlib.util.latexprinter.iprint_latex)
# [`MathObject`]: (class algebraixlib.mathobjects.mathobject.MathObject)
# In[1]:
import algebraixlib.io.rdf as rdf
from algebraixlib.util.latexprinter import iprint_latex
suppliers = rdf.import_graph('supplier.ttl')
iprint_latex('suppliers', short=True)
# The main selection and combination mechanism in SPARQL are sequences of basic graph patterns (BGPs) like "`?supplier ?suppkey`". Such a BGP is a sequence of two operations:
#
# - First a superstriction that selects the relations that match the given value(s). In our first BGP we match the value `` in the couplet with left part `p`.
# - This is followed by a composition that selects and renames the lefts of interest. The lefts of interest are the ones associated with SPARQL variables; they are one or more of the `s`, `p` or `o` lefts and are renamed to a variable name. In our first BGP, we need both the `s` and the `o` lefts and we name them `supplier` resp. `suppkey`.
#
# A superstriction is defined on sets (relations) and sets of sets (clans) as follows:
#
# $$
# \begin{align*}
# \text{Set superstriction}:\
# &S \vartriangleright T
# &&:=
# \begin{cases}
# S &\text{if } S \supset T \\
# \text{undefined} &\text{otherwise}
# \end{cases}
# &\text{for } S, T \in P(M)\\
# \text{Set of sets superstriction}:\
# &\mathbb{S} \blacktriangleright \mathbb{T}
# &&:=
# \{X \vartriangleright Y\ : X \in \mathbb{S} \text{ and } Y \in \mathbb{T}\}
# &\text{for } \mathbb{S}, \mathbb{T} \in P^2(M)
# \end{align*}
# $$
#
# The clan (set of sets) superstriction restricts the members of one clan to the ones that are a superset of a member of the other clan.
#
# Composition essentially selects certain couplets and changes their lefts or rights. In our case, we want to select and rename the lefts `s` and `o` and rename them to `supplier` and `suppkey`, respectively.
#
# With this, the execution of a BGP results in a clan, where the variable names in the BGP are the lefts in the clan. (We strip the leading question mark from the variable names; this is merely an artifact of the SPARQL syntax.)
#
# In data algebra, we get this:
#
# $$
# \begin{align*}
# BGP_{suppkey,matches} &= Supplier \blacktriangleright \{\{p{\mapsto}\text{}\}\} \\
# BGP_{suppkey} &= BGP_{suppkey,matches} \circ \{\{supplier{\mapsto}s, suppkey{\mapsto}o\}\}
# \end{align*}
# $$
#
# - [`rdflib`][], [`rdflib.URIRef`][]: IRIs in our RDF graphs are represented as atoms with a value type of `rdflib.URIRef`, so the IRI `` becomes `rdflib.URIRef('tpch:suppkey')` in the code.
# - [`algebraixlib.algebras.clans`][] contains the functions that are related to our algebra of clans.
# - [`clans.superstrict`][] executes the clan superstriction ($\blacktriangleright$).
# - [`clans.compose`][] executes the clan composition ($\circ$).
# - [`clans.from_dict`][] creates a clan with a single relation. The relation represents the data in the `dict` argument, converting the keys to lefts and the values to rights.
#
# [`rdflib`]: (A Python library for working with RDF.)
# [`rdflib.URIRef`]: (RDF URI Reference)
# [`algebraixlib.algebras.clans`]: (module algebraixlib.algebras.clans)
# [`clans.superstrict`]: (function algebraixlib.algebras.clans.superstrict)
# [`clans.compose`]: (function algebraixlib.algebras.clans.compose)
# [`clans.from_dict`]: (function algebraixlib.algebras.clans.from_dict)
# In[2]:
import rdflib
import algebraixlib.algebras.clans as clans
bgp_suppkey_matches = clans.superstrict(suppliers, clans.from_dict({'p': rdflib.URIRef('tpch:suppkey')}))
bgp_suppkey = clans.compose(bgp_suppkey_matches, clans.from_dict({'supplier': 's', 'suppkey': 'o'}))
iprint_latex('bgp_suppkey', short=True)
# For converting the second BGP "`?supplier ?nationkey`", we use the same technique, only this time we remove the intermediate result ($BGP_{suppkey,matches}$ in the equation above):
#
# $$
# BGP_{nationkey} = (Supplier \blacktriangleright \{\{p{\mapsto}\text{}\}\})
# \circ \{\{supplier{\mapsto}s, nationkey{\mapsto}o\}\}
# $$
#
# In the code we did something similar, removing the intermediate variable (`bgp_suppkey_matches` in the code above):
# In[3]:
bgp_nationkey = clans.compose(clans.superstrict(suppliers, clans.from_dict({'p': rdflib.URIRef('tpch:nationkey')})),
clans.from_dict({'supplier': 's', 'nationkey': 'o'}))
iprint_latex('bgp_nationkey', short=True)
# The last step of processing a sequence of BGPs is to join the clans they generated. A solution is defined as a set where every variable has exactly one value, no matter in how many BGPs it appears. Mathematically, we represent this as functional cross-union (similar to an inner join in SQL).
#
# The implicit join of all BGPs to create a set of solutions is represented by a functional cross-union of the individual BGP result clans. It is again a clan, with a column for every variable present in any of the BGPs.
#
# Typically, there are variables in a query that are only needed for the join and that are not part of the query result (`?supplier` in our example). To get rid of these, we project the result just as we did in the [`customer` Table][] example.
#
# In data algebra:
#
# $$
# Supplier_{Solutions} = (BGP_{suppkey} \underset{f}{\blacktriangledown} BGP_{nationkey}) \circ \{\{nationkey{\mapsto}nationkey, suppkey{\mapsto}suppkey\}\}
# $$
#
# - [`clans.project`][] executes the composition of a clan with a clan diagonal.
# - [`clans.cross_functional_union`][] executes the functional cross-union of two clans.
#
# [`customer` Table]: <2-Tables.ipynb#The-customer-Table> (IPython Notebook: TPC-H Query 5 - Tables - The `customer` Table)
# [`clans.project`]: (function algebraixlib.algebras.clans.project)
# [`clans.cross_functional_union`]: (function algebraixlib.algebras.clans.cross_functional_union)
# In[4]:
supplier_solutions = clans.project(
clans.cross_functional_union(bgp_suppkey, bgp_nationkey), 'nationkey', 'suppkey')
iprint_latex('supplier_solutions', short=True)
# # Next Step
#
# Continue with [4-Hierarchies][]; it introduces our representation of hierarchical data, on the example of XML.
#
# [4-Hierarchies]: <4-Hierarchies.ipynb> (IPython Notebook: TPC-H Query 5 - Hierarchies)
# ----
# © Copyright Permission.io, Inc. (formerly known as Algebraix Data Corporation), Copyright (c) 2022.
#
# This file is part of [`algebraixlib`][] .
#
# [`algebraixlib`][] is free software: you can redistribute it and/or modify it under the terms of [version 3 of the GNU Lesser General Public License][] as published by the [Free Software Foundation][].
#
# [`algebraixlib`][] is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public License along with [`algebraixlib`][]. If not, see [GNU licenses][].
#
# [`algebraixlib`]: (A Python library for data algebra)
# [Version 3 of the GNU Lesser General Public License]:
# [Free Software Foundation]:
# [GNU licenses]: