Blaze provides a lightweight interface on top of pre-existing computational infrastructure. This notebook gives a quick overview of how Blaze interacts with a variety of data types.
%reload_ext autotime
from blaze import Data, by, compute
Blaze interacts with normal Python objects. Operations on Blaze Data
objects create expression trees.
These expressions deliver an intuitive numpy/pandas-like feel.
Starting small, Blaze interacts happily with collections of data.
It uses Pandas for pretty notebook printing.
x = Data([1, 2, 3, 4, 5])
x
None | |
---|---|
0 | 1 |
1 | 2 |
2 | 3 |
3 | 4 |
4 | 5 |
time: 7 ms
x[x > 2] * 10
None | |
---|---|
0 | 30 |
1 | 40 |
2 | 50 |
time: 18.5 ms
x.dshape
dshape("5 * int64")
time: 1.49 ms
Slightly more exciting, Blaze operates on tabular data
L = [[1, 'Alice', 100],
[2, 'Bob', -200],
[3, 'Charlie', 300],
[4, 'Dennis', 400],
[5, 'Edith', -500]]
time: 1.75 ms
x = Data(L, fields=['id', 'name', 'amount'])
time: 1.93 ms
x.amount.mean()
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-43-f4afcda15db4> in <module>() ----> 1 x.amount.mean() /Users/pcloud/Documents/code/py/blaze/blaze/expr/expressions.py in __getattr__(self, key) 169 pass 170 try: --> 171 result = object.__getattribute__(self, key) 172 except AttributeError: 173 fields = dict(zip(map(valid_identifier, self.fields), AttributeError: 'InteractiveSymbol' object has no attribute 'amount'
time: 194 ms
x.dshape
dshape("5 * {id: int64, name: string, amount: int64}")
time: 1.84 ms
x
again¶x
id | name | amount | |
---|---|---|---|
0 | 1 | Alice | 100 |
1 | 2 | Bob | -200 |
2 | 3 | Charlie | 300 |
3 | 4 | Dennis | 400 |
4 | 5 | Edith | -500 |
time: 9.45 ms
deadbeats = x[x.amount < 0].name
deadbeats
name | |
---|---|
0 | Bob |
1 | Edith |
time: 12.1 ms
Blaze doesn't do work, it just tells other systems to do work.
In the previous example, Blaze told Python which for-loops to write. In this example, it calls the right functions in Pandas.
The user experience is mostly identical, only performance differs.
from pandas import DataFrame
df = DataFrame([[1, 'Alice', 100],
[2, 'Bob', -200],
[3, 'Charlie', 300],
[4, 'Denis', 400],
[5, 'Edith', -500]], columns=['id', 'name', 'amount'])
time: 2.52 ms
df
id | name | amount | |
---|---|---|---|
0 | 1 | Alice | 100 |
1 | 2 | Bob | -200 |
2 | 3 | Charlie | 300 |
3 | 4 | Denis | 400 |
4 | 5 | Edith | -500 |
time: 4.79 ms
x = Data(df)
x
id | name | amount | |
---|---|---|---|
0 | 1 | Alice | 100 |
1 | 2 | Bob | -200 |
2 | 3 | Charlie | 300 |
3 | 4 | Denis | 400 |
4 | 5 | Edith | -500 |
time: 10.8 ms
deadbeats = x[x.amount < 0].name
deadbeats
name | |
---|---|
1 | Bob |
4 | Edith |
time: 19.6 ms
type(deadbeats)
blaze.expr.expressions.Field
time: 1.52 ms
compute
turns Blaze expressions into something concrete¶compute(deadbeats)
1 Bob 4 Edith Name: name, dtype: object
time: 4.96 ms
type(compute(deadbeats))
pandas.core.series.Series
time: 3.33 ms
Table
s¶Blaze extends beyond just Python and Pandas (that's the main motivation.)
Here it drives SQLAlchemy.
from sqlalchemy import Table, Column, MetaData, Integer, String, create_engine
tab = Table('bank', MetaData(),
Column('id', Integer),
Column('name', String),
Column('amount', Integer))
time: 1.97 ms
x = Data(tab)
x.dshape
dshape("var * {id: ?int32, name: ?string, amount: ?int32}")
time: 2.62 ms
Just like computations on pandas objects produce pandas objects, computations on SQLAlchemy tables produce SQLAlchemy Select statements.
deadbeats = x[x.amount < 0].name
compute(deadbeats)
<sqlalchemy.sql.selectable.Select at 0x11767b1d0; Select object>
time: 7.63 ms
print(compute(deadbeats)) # SQLAlchemy generates SQL
SELECT bank.name FROM bank WHERE bank.amount < :amount_1 time: 3.31 ms
When we drive a SQLAlchemy table connected to a database we get actual computation.
engine = create_engine('sqlite:///../blaze/blaze/examples/data/iris.db')
time: 10 ms
x = Data(engine)
x
time: 8.96 ms
x.fields
['iris']
time: 1.2 ms
x.iris.sepal_length.mean()
time: 10.6 ms
by(
x.iris.species,
shortest=x.iris.sepal_length.min(),
longest=x.iris.sepal_length.max()
)
species | longest | shortest | |
---|---|---|---|
0 | Iris-setosa | 5.8 | 4.3 |
1 | Iris-versicolor | 7.0 | 4.9 |
2 | Iris-virginica | 7.9 | 4.9 |
time: 51 ms
print(compute(_))
SELECT iris.species, max(iris.sepal_length) AS longest, min(iris.sepal_length) AS shortest FROM iris GROUP BY iris.species time: 8.3 ms
Often just figuring out how to produce the relevant Python object can be a challenge.
Blaze supports many formats of URI strings
x = Data('sqlite:///../blaze/blaze/examples/data/iris.db::iris')
time: 7.4 ms
x
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
5 | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa |
6 | 4.6 | 3.4 | 1.4 | 0.3 | Iris-setosa |
7 | 5.0 | 3.4 | 1.5 | 0.2 | Iris-setosa |
8 | 4.4 | 2.9 | 1.4 | 0.2 | Iris-setosa |
9 | 4.9 | 3.1 | 1.5 | 0.1 | Iris-setosa |
10 | 5.4 | 3.7 | 1.5 | 0.2 | Iris-setosa |
time: 16.7 ms