#!/usr/bin/env python # coding: utf-8 # # TPC-H Query 5 – Graphs as Clans # # This IPython notebook is part of a series of tutorials that introduce how data algebra facilitates querying data from multiple sources and in different structures, on the example of a modified [TPC-H][] query. # # The tutorials assume basic familiarity with our library; we suggest working through our [Hello_World][] tutorial first. They also assume basic knowledge of relational data, RDF and XML. # # In some cases, later parts of the tutorial assume knowledge of concepts introduced in earlier parts, so it is best to work through them in the listed sequence: # # - [1-Introduction][]: Introduces this series of tutorials, TPC-H, the TPC-H query 5 and our modifications to it. # - [2-Tables][]: Introduces our representation of tabular data, on the example of CSV. # - **[3-Graphs][] (this tutorial)**: Introduces our representation of RDF-style tabular graph data, on the example of Turtle. # - [4-Hierarchies][]: Introduces our representation of hierarchical data, on the example of XML. # - [5-Query][]: Brings it all together and explains the whole query. # # [TPC-H]: (TPC-H Benchmark Main Page) # [Hello_World]: <../Hello_World.ipynb> (IPython Notebook: Hello World) # [1-Introduction]: <1-Introduction.ipynb> (IPython Notebook: TPC-H Query 5 - Introduction) # [2-Tables]: <2-Tables.ipynb> (IPython Notebook: TPC-H Query 5 - Tables) # [3-Graphs]: <3-Graphs.ipynb> (IPython Notebook: TPC-H Query 5 - Graphs) # [4-Hierarchies]: <4-Hierarchies.ipynb> (IPython Notebook: TPC-H Query 5 - Hierarchies) # [5-Query]: <5-Query.ipynb> (IPython Notebook: TPC-H Query 5 - Query) # ## Graph Data In the Query # # We converted the table `supplier` from the original data into an RDF graph `supplier`. In this tutorial, we present our clan representation of RDF graphs, how to import them from [N-Triples][] and [Turtle][] formats and how to execute simple SPARQL matching (in the form of basic graph patterns) on the example of the SPARQL 'subquery' that is embedded in our modified query 5: # # ``` sparql # ( # # This is a pseudo-subquery in SPARQL. It extracts a table with the # # columns 'suppkey' and 'nationkey' from the RDF graph 'supplier' # # and injects them into the outer query as table 'supplier'. # SELECT # ?suppkey, ?nationkey # FROM # supplier # WHERE { # ?supplier ?suppkey . # ?supplier ?nationkey . # } # ) AS supplier_solutions # ``` # # This query creates a table with the columns `suppkey` and `nationkey` that associates supplier keys with the nation keys of the suppliers' nation. # # [N-Triples]: (RDF 1.1 N-Triples - A line-based syntax for an RDF graph) # [Turtle]: (RDF 1.1 Turtle - Terse RDF Triple Language) # ## Representation of RDF Graphs as Clans # # The common RDF graph serialization formats provide the graph data in tabular form, with three columns that represent subject, predicate and object. This applies most clearly to the [N-Triples][] format; [Turtle][] and other formats provide syntactic shortcuts that allow for example repetition of subject over predicate/object lists or subject and predicate over object lists, but this doesn't invalidate the principle. # # With this, our internal representation is not different from the one we use for tables; such a graph is represented as a [regular clan][] with the lefts `s`, `p` and `o` (representing subject, predicate and object, respectively). The description of our internal [representation of tables][] applies here as well. What makes RDF graphs different from 'normal' tables are the operations that are used for the common questions. # # Our Turtle import function is not a full-featured RDF implementation. The most important limitations are: # # - Escape sequences are not fully supported. # - Unicode may or may not work. # - No support for blank nodes. # - No support for predicate-object lists and object lists. # - Type support: Plain integers like `15` are converted into integer atoms, plain strings like `"str"` are converted into string atoms, and IRIs like `` are converted into `rdflib.URIRef` instances. There is no support for the IRI-based RDF type system. # # # [N-Triples]: (RDF 1.1 N-Triples - A line-based syntax for an RDF graph) # [Turtle]: (RDF 1.1 Turtle - Terse RDF Triple Language) # [regular clan]: ((left-)regular clan) # [Representation of Tables]: <2-Tables.ipynb#Representation-of-Tables> (IPython Notebook: TPC-H Query 5 - Tables - Representation of Tables) # # The `supplier` Graph # # We used a simple, straightforward method to convert the original `supplier` table into an RDF graph: subjects in the graph represent rows and are IRIs derived from the `suppkey` column value (the primary key), predicates represent columns and are IRIs derived from the column name, and the objects are the cell values for a given subject (row) and predicate (column). # # First we import the RDF graph `supplier` from a Turtle file `supplier.ttl`. # # - [`algebraixlib.io.rdf`][] provides utilities for processing RDF-style graph data. # - [`iprint_latex`][] is a utility that prints our [`MathObject`][]s in LaTeX format in IPython notebooks. The `short=True` argument tells it to create abbreviated output. # # [`algebraixlib.io.rdf`]: (module algebraixlib.io.rdf) # [`iprint_latex`]: (function algebraixlib.util.latexprinter.iprint_latex) # [`MathObject`]: (class algebraixlib.mathobjects.mathobject.MathObject) # In[1]: import algebraixlib.io.rdf as rdf from algebraixlib.util.latexprinter import iprint_latex suppliers = rdf.import_graph('supplier.ttl') iprint_latex('suppliers', short=True) # The main selection and combination mechanism in SPARQL are sequences of basic graph patterns (BGPs) like "`?supplier ?suppkey`". Such a BGP is a sequence of two operations: # # - First a superstriction that selects the relations that match the given value(s). In our first BGP we match the value `` in the couplet with left part `p`. # - This is followed by a composition that selects and renames the lefts of interest. The lefts of interest are the ones associated with SPARQL variables; they are one or more of the `s`, `p` or `o` lefts and are renamed to a variable name. In our first BGP, we need both the `s` and the `o` lefts and we name them `supplier` resp. `suppkey`. # # A superstriction is defined on sets (relations) and sets of sets (clans) as follows: # # $$ # \begin{align*} # \text{Set superstriction}:\ # &S \vartriangleright T # &&:= # \begin{cases} # S &\text{if } S \supset T \\ # \text{undefined} &\text{otherwise} # \end{cases} # &\text{for } S, T \in P(M)\\ # \text{Set of sets superstriction}:\ # &\mathbb{S} \blacktriangleright \mathbb{T} # &&:= # \{X \vartriangleright Y\ : X \in \mathbb{S} \text{ and } Y \in \mathbb{T}\} # &\text{for } \mathbb{S}, \mathbb{T} \in P^2(M) # \end{align*} # $$ # # The clan (set of sets) superstriction restricts the members of one clan to the ones that are a superset of a member of the other clan. # # Composition essentially selects certain couplets and changes their lefts or rights. In our case, we want to select and rename the lefts `s` and `o` and rename them to `supplier` and `suppkey`, respectively. # # With this, the execution of a BGP results in a clan, where the variable names in the BGP are the lefts in the clan. (We strip the leading question mark from the variable names; this is merely an artifact of the SPARQL syntax.) # # In data algebra, we get this: # # $$ # \begin{align*} # BGP_{suppkey,matches} &= Supplier \blacktriangleright \{\{p{\mapsto}\text{}\}\} \\ # BGP_{suppkey} &= BGP_{suppkey,matches} \circ \{\{supplier{\mapsto}s, suppkey{\mapsto}o\}\} # \end{align*} # $$ # # - [`rdflib`][], [`rdflib.URIRef`][]: IRIs in our RDF graphs are represented as atoms with a value type of `rdflib.URIRef`, so the IRI `` becomes `rdflib.URIRef('tpch:suppkey')` in the code. # - [`algebraixlib.algebras.clans`][] contains the functions that are related to our algebra of clans. # - [`clans.superstrict`][] executes the clan superstriction ($\blacktriangleright$). # - [`clans.compose`][] executes the clan composition ($\circ$). # - [`clans.from_dict`][] creates a clan with a single relation. The relation represents the data in the `dict` argument, converting the keys to lefts and the values to rights. # # [`rdflib`]: (A Python library for working with RDF.) # [`rdflib.URIRef`]: (RDF URI Reference) # [`algebraixlib.algebras.clans`]: (module algebraixlib.algebras.clans) # [`clans.superstrict`]: (function algebraixlib.algebras.clans.superstrict) # [`clans.compose`]: (function algebraixlib.algebras.clans.compose) # [`clans.from_dict`]: (function algebraixlib.algebras.clans.from_dict) # In[2]: import rdflib import algebraixlib.algebras.clans as clans bgp_suppkey_matches = clans.superstrict(suppliers, clans.from_dict({'p': rdflib.URIRef('tpch:suppkey')})) bgp_suppkey = clans.compose(bgp_suppkey_matches, clans.from_dict({'supplier': 's', 'suppkey': 'o'})) iprint_latex('bgp_suppkey', short=True) # For converting the second BGP "`?supplier ?nationkey`", we use the same technique, only this time we remove the intermediate result ($BGP_{suppkey,matches}$ in the equation above): # # $$ # BGP_{nationkey} = (Supplier \blacktriangleright \{\{p{\mapsto}\text{}\}\}) # \circ \{\{supplier{\mapsto}s, nationkey{\mapsto}o\}\} # $$ # # In the code we did something similar, removing the intermediate variable (`bgp_suppkey_matches` in the code above): # In[3]: bgp_nationkey = clans.compose(clans.superstrict(suppliers, clans.from_dict({'p': rdflib.URIRef('tpch:nationkey')})), clans.from_dict({'supplier': 's', 'nationkey': 'o'})) iprint_latex('bgp_nationkey', short=True) # The last step of processing a sequence of BGPs is to join the clans they generated. A solution is defined as a set where every variable has exactly one value, no matter in how many BGPs it appears. Mathematically, we represent this as functional cross-union (similar to an inner join in SQL). # # The implicit join of all BGPs to create a set of solutions is represented by a functional cross-union of the individual BGP result clans. It is again a clan, with a column for every variable present in any of the BGPs. # # Typically, there are variables in a query that are only needed for the join and that are not part of the query result (`?supplier` in our example). To get rid of these, we project the result just as we did in the [`customer` Table][] example. # # In data algebra: # # $$ # Supplier_{Solutions} = (BGP_{suppkey} \underset{f}{\blacktriangledown} BGP_{nationkey}) \circ \{\{nationkey{\mapsto}nationkey, suppkey{\mapsto}suppkey\}\} # $$ # # - [`clans.project`][] executes the composition of a clan with a clan diagonal. # - [`clans.cross_functional_union`][] executes the functional cross-union of two clans. # # [`customer` Table]: <2-Tables.ipynb#The-customer-Table> (IPython Notebook: TPC-H Query 5 - Tables - The `customer` Table) # [`clans.project`]: (function algebraixlib.algebras.clans.project) # [`clans.cross_functional_union`]: (function algebraixlib.algebras.clans.cross_functional_union) # In[4]: supplier_solutions = clans.project( clans.cross_functional_union(bgp_suppkey, bgp_nationkey), 'nationkey', 'suppkey') iprint_latex('supplier_solutions', short=True) # # Next Step # # Continue with [4-Hierarchies][]; it introduces our representation of hierarchical data, on the example of XML. # # [4-Hierarchies]: <4-Hierarchies.ipynb> (IPython Notebook: TPC-H Query 5 - Hierarchies) # ---- # © Copyright Permission.io, Inc. (formerly known as Algebraix Data Corporation), Copyright (c) 2022. # # This file is part of [`algebraixlib`][] . # # [`algebraixlib`][] is free software: you can redistribute it and/or modify it under the terms of [version 3 of the GNU Lesser General Public License][] as published by the [Free Software Foundation][]. # # [`algebraixlib`][] is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. # # You should have received a copy of the GNU Lesser General Public License along with [`algebraixlib`][]. If not, see [GNU licenses][]. # # [`algebraixlib`]: (A Python library for data algebra) # [Version 3 of the GNU Lesser General Public License]: # [Free Software Foundation]: # [GNU licenses]: