How to leverage the entire PyData Stack¶

A quick poll ...¶

Who uses pandas?¶

Who uses numpy?¶

Blaze - A Quick Tour¶

Blaze provides a lightweight interface on top of pre-existing computational infrastructure. This notebook gives a quick overview of how Blaze interacts with a variety of data types.

In [37]:

%reload_ext autotime

from blaze import Data, by, compute

Blaze wraps pre-existing data¶

Blaze interacts with normal Python objects. Operations on Blaze Data objects create expression trees.

These expressions deliver an intuitive numpy/pandas-like feel.

Lists¶

Starting small, Blaze interacts happily with collections of data.

It uses Pandas for pretty notebook printing.

In [38]:

x = Data([1, 2, 3, 4, 5])
x

Out[38]:

	None
0	1
1	2
2	3
3	4
4	5

time: 7 ms

In [5]:

x[x > 2] * 10

Out[5]:

	None
0	30
1	40
2	50

time: 18.5 ms

In [42]:

x.dshape

Out[42]:

dshape("5 * int64")

time: 1.49 ms

Or Tabular, Pandas-like datasets¶

Slightly more exciting, Blaze operates on tabular data

In [9]:

L = [[1, 'Alice',   100],
     [2, 'Bob',    -200],
     [3, 'Charlie', 300],
     [4, 'Dennis',  400],
     [5, 'Edith',  -500]]

time: 1.75 ms

In [10]:

x = Data(L, fields=['id', 'name', 'amount'])

time: 1.93 ms

In [43]:

x.amount.mean()

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-43-f4afcda15db4> in <module>()
----> 1 x.amount.mean()

/Users/pcloud/Documents/code/py/blaze/blaze/expr/expressions.py in __getattr__(self, key)
    169             pass
    170         try:
--> 171             result = object.__getattribute__(self, key)
    172         except AttributeError:
    173             fields = dict(zip(map(valid_identifier, self.fields),

AttributeError: 'InteractiveSymbol' object has no attribute 'amount'

time: 194 ms

In [12]:

x.dshape

Out[12]:

dshape("5 * {id: int64, name: string, amount: int64}")

time: 1.84 ms

Here's `x` again¶

In [13]:

Out[13]:

	id	name	amount
0	1	Alice	100
1	2	Bob	-200
2	3	Charlie	300
3	4	Dennis	400
4	5	Edith	-500

time: 9.45 ms

In [14]:

deadbeats = x[x.amount < 0].name
deadbeats

Out[14]:

	name
0	Bob
1	Edith

time: 12.1 ms

Or it can even just drive pandas¶

Blaze doesn't do work, it just tells other systems to do work.

In the previous example, Blaze told Python which for-loops to write. In this example, it calls the right functions in Pandas.

The user experience is mostly identical, only performance differs.

In [15]:

from pandas import DataFrame

df = DataFrame([[1, 'Alice',   100],                         
                [2, 'Bob',    -200],
                [3, 'Charlie', 300],
                [4, 'Denis',   400],
                [5, 'Edith',  -500]], columns=['id', 'name', 'amount'])

time: 2.52 ms

In [16]:

df

Out[16]:

	id	name	amount
0	1	Alice	100
1	2	Bob	-200
2	3	Charlie	300
3	4	Denis	400
4	5	Edith	-500

time: 4.79 ms

In [17]:

x = Data(df)
x

Out[17]:

	id	name	amount
0	1	Alice	100
1	2	Bob	-200
2	3	Charlie	300
3	4	Denis	400
4	5	Edith	-500

time: 10.8 ms

In [18]:

deadbeats = x[x.amount < 0].name
deadbeats

Out[18]:

	name
1	Bob
4	Edith

time: 19.6 ms

Outputs are Blaze expressions¶

In [19]:

type(deadbeats)

Out[19]:

blaze.expr.expressions.Field

time: 1.52 ms

`compute` turns Blaze expressions into something concrete¶

In [20]:

compute(deadbeats)

Out[20]:

1      Bob
4    Edith
Name: name, dtype: object

time: 4.96 ms

In [21]:

type(compute(deadbeats))

Out[21]:

pandas.core.series.Series

time: 3.33 ms

Blaze also works with other data types like SQLAlchemy `Table`s¶

Blaze extends beyond just Python and Pandas (that's the main motivation.)

Here it drives SQLAlchemy.

In [22]:

from sqlalchemy import Table, Column, MetaData, Integer, String, create_engine

tab = Table('bank', MetaData(),
            Column('id', Integer),
            Column('name', String),
            Column('amount', Integer))

time: 1.97 ms

In [23]:

x = Data(tab)
x.dshape

Out[23]:

dshape("var * {id: ?int32, name: ?string, amount: ?int32}")

time: 2.62 ms

Just like computations on pandas objects produce pandas objects, computations on SQLAlchemy tables produce SQLAlchemy Select statements.

In [24]:

deadbeats = x[x.amount < 0].name
compute(deadbeats)

Out[24]:

<sqlalchemy.sql.selectable.Select at 0x11767b1d0; Select object>

time: 7.63 ms

In [25]:

print(compute(deadbeats))  # SQLAlchemy generates SQL

SELECT bank.name 
FROM bank 
WHERE bank.amount < :amount_1
time: 3.31 ms

Let's connect to a real database¶

When we drive a SQLAlchemy table connected to a database we get actual computation.

In [26]:

engine = create_engine('sqlite:///../blaze/blaze/examples/data/iris.db')

time: 10 ms

In [28]:

x = Data(engine)
x

Out[28]:

Data: Engine(sqlite:///../blaze/blaze/examples/data/iris.db)
DataShape: {
iris: var * {
sepal_length: ?float64,
sepal_width: ?float64,
petal_length: ?float64,
petal_width: ?float64,
species: ?string
...

time: 8.96 ms

In [29]:

x.fields

Out[29]:

['iris']

time: 1.2 ms

In [30]:

x.iris.sepal_length.mean()

Out[30]:

5.843333333333335

time: 10.6 ms

In [31]:

by(
    x.iris.species,
    shortest=x.iris.sepal_length.min(),
    longest=x.iris.sepal_length.max()
)

Out[31]:

	species	longest	shortest
0	Iris-setosa	5.8	4.3
1	Iris-versicolor	7.0	4.9
2	Iris-virginica	7.9	4.9

time: 51 ms

In [32]:

print(compute(_))

SELECT iris.species, max(iris.sepal_length) AS longest, min(iris.sepal_length) AS shortest 
FROM iris GROUP BY iris.species
time: 8.3 ms

Use URI strings to ease access¶

Often just figuring out how to produce the relevant Python object can be a challenge.

Blaze supports many formats of URI strings

In [33]:

x = Data('sqlite:///../blaze/blaze/examples/data/iris.db::iris')

time: 7.4 ms

In [34]:

Out[34]:

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa
5	5.4	3.9	1.7	0.4	Iris-setosa
6	4.6	3.4	1.4	0.3	Iris-setosa
7	5.0	3.4	1.5	0.2	Iris-setosa
8	4.4	2.9	1.4	0.2	Iris-setosa
9	4.9	3.1	1.5	0.1	Iris-setosa
10	5.4	3.7	1.5	0.2	Iris-setosa

time: 16.7 ms