conda install -c blaze blaze
For basic tabular queries, Blaze shares the same syntax as Pandas.
from blaze import Data, by, join, transform
bank = Data([[1, 'Alice', 100],
[2, 'Bob', -200],
[3, 'Charlie', 300],
[4, 'Dennis', 400],
[5, 'Edith', -500]], columns=['id', 'name', 'amount'])
bank.amount
amount 0 100 1 -200 2 300 3 400 4 -500
bank.amount / 100
amount 0 1 1 -2 2 3 3 4 4 -5
(bank.amount / 100).mean()
0.2
bank[['name', 'amount']].sort('amount')
We select subsets of data by indexing one expression with another
bank[bank.amount < 0]
id name amount 0 2 Bob -200 1 5 Edith -500
We can combine these sorts of operations with each other
bank[bank.amount < 0].amount / 100
amount 0 -2 1 -5
bank[bank.amount < 0].name
name 0 Bob 1 Edith
Write expressions to answer the following questions
# What are the IDs of everyone with a positive amount?
# What is the name of the person with amount 400?
# What is the difference between the minimum and maximum amounts?
First, we need a more interesting dataset. We open the standard iris dataset, a table of 150 measurements of flowers in the iris genus. We find this dataset in a CSV file in the data/
directory.
iris = Data('data/iris.csv')
iris
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
5 | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa |
6 | 4.6 | 3.4 | 1.4 | 0.3 | Iris-setosa |
7 | 5.0 | 3.4 | 1.5 | 0.2 | Iris-setosa |
8 | 4.4 | 2.9 | 1.4 | 0.2 | Iris-setosa |
9 | 4.9 | 3.1 | 1.5 | 0.1 | Iris-setosa |
10 | 5.4 | 3.7 | 1.5 | 0.2 | Iris-setosa |
The blaze.Data
function operates on all of the file types that we saw in the previous sections on into
. Blaze expressions use functions like discover
to get datashapes that help them interact with you.
iris.dshape
dshape("""var * { sepal_length: float64, sepal_width: float64, petal_length: float64, petal_width: float64, species: string }""")
Now some more queries. Distinct finds unique entries
iris.species.distinct()
species 0 Iris-setosa 1 Iris-versicolor 2 Iris-virginica
Or count the number of distinct entries
iris.species.nunique()
3
iris.sepal_length.nunique()
35
Transform adds new columns based on old ones
transform(iris, sepal_ratio=iris.sepal_length / iris.sepal_width,
petal_ratio=iris.petal_length / iris.petal_width)
sepal_length sepal_width petal_length petal_width species \ 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa 5 5.4 3.9 1.7 0.4 Iris-setosa 6 4.6 3.4 1.4 0.3 Iris-setosa 7 5.0 3.4 1.5 0.2 Iris-setosa 8 4.4 2.9 1.4 0.2 Iris-setosa 9 4.9 3.1 1.5 0.1 Iris-setosa 10 5.4 3.7 1.5 0.2 Iris-setosa sepal_ratio petal_ratio 0 1.457143 7.000000 1 1.633333 7.000000 2 1.468750 6.500000 3 1.483871 7.500000 4 1.388889 7.000000 5 1.384615 4.250000 6 1.352941 4.666667 7 1.470588 7.500000 8 1.517241 7.000000 9 1.580645 15.000000 ...
by
¶Split-apply-combine queries, also known as Group-By, split the table into many groups and then do a reduction on each group. We express these queries in blaze with the by
operator
by(column-on-which-to-split, result_name=reduction_on_group())
How many measurements do we have per species?
by(iris.species, count=iris.species.count())
species count 0 Iris-setosa 50 1 Iris-versicolor 50 2 Iris-virginica 50
How many measurements do we have per species and what is the longest petal length per species?
by(iris.species, count=iris.species.count(),
longest_petal=iris.petal_length.max())
species count longest_petal 0 Iris-setosa 50 1.9 1 Iris-versicolor 50 5.1 2 Iris-virginica 50 6.9
Write queries to answer the following questions
# What are the longest and shortest sepal_lengths per species?
# What is the difference of longest to shortest sepal length per species
So far, everything we've seen is similar to solving problems in Pandas
import pandas as pd
df = pd.read_csv('data/iris.csv')
df.groupby(df.species).sepal_length.min()
species Iris-setosa 4.3 Iris-versicolor 4.9 Iris-virginica 4.9 Name: sepal_length, dtype: float64
In fact, for small CSV files like this, Blaze uses Pandas, so one might consider just using Pandas directly.
Blaze becomes more useful when we interact with data stored in different systems like SQL databases in the next section.