Blaze Logo

Getting Started with Blaze¶

Full tutorial available at http://github.com/ContinuumIO/blaze-tutorial
Install software with conda install -c blaze blaze

1. Basic Queries¶

For basic tabular queries, Blaze shares the same syntax as Pandas.

In [1]:

from blaze import Data, by, join, transform

In [2]:

bank = Data([[1, 'Alice',   100],
             [2, 'Bob',    -200],
             [3, 'Charlie', 300],
             [4, 'Dennis',  400],
             [5, 'Edith',  -500]], columns=['id', 'name', 'amount'])

Arithmetic and Reductions¶

In [3]:

bank.amount

Out[3]:

In [4]:

bank.amount / 100

Out[4]:

In [5]:

(bank.amount / 100).mean()

Out[5]:

0.2

Multiple columns and sorting¶

In [ ]:

bank[['name', 'amount']].sort('amount')

Selections¶

We select subsets of data by indexing one expression with another

In [6]:

bank[bank.amount < 0]

Out[6]:

   id   name  amount
0   2    Bob    -200
1   5  Edith    -500

Combining Operations¶

We can combine these sorts of operations with each other

In [7]:

bank[bank.amount < 0].amount / 100

Out[7]:

   amount
0      -2
1      -5

In [8]:

bank[bank.amount < 0].name

Out[8]:

    name
0    Bob
1  Edith

Exercises¶

Write expressions to answer the following questions

In [9]:

# What are the IDs of everyone with a positive amount?

In [10]:

# What is the name of the person with amount 400?

In [11]:

# What is the difference between the minimum and maximum amounts?

2. More complex queries¶

First, we need a more interesting dataset. We open the standard iris dataset, a table of 150 measurements of flowers in the iris genus. We find this dataset in a CSV file in the data/ directory.

In [12]:

iris = Data('data/iris.csv')
iris

Out[12]:

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa
5	5.4	3.9	1.7	0.4	Iris-setosa
6	4.6	3.4	1.4	0.3	Iris-setosa
7	5.0	3.4	1.5	0.2	Iris-setosa
8	4.4	2.9	1.4	0.2	Iris-setosa
9	4.9	3.1	1.5	0.1	Iris-setosa
10	5.4	3.7	1.5	0.2	Iris-setosa

The blaze.Data function operates on all of the file types that we saw in the previous sections on into. Blaze expressions use functions like discover to get datashapes that help them interact with you.

In [13]:

iris.dshape

Out[13]:

dshape("""var * {
  sepal_length: float64,
  sepal_width: float64,
  petal_length: float64,
  petal_width: float64,
  species: string
  }""")

Distinct¶

Now some more queries. Distinct finds unique entries

In [14]:

iris.species.distinct()

Out[14]:

           species
0      Iris-setosa
1  Iris-versicolor
2   Iris-virginica

Or count the number of distinct entries

In [15]:

iris.species.nunique()

Out[15]:

In [16]:

iris.sepal_length.nunique()

Out[16]:

Transform¶

Transform adds new columns based on old ones

In [17]:

transform(iris, sepal_ratio=iris.sepal_length / iris.sepal_width,
                petal_ratio=iris.petal_length / iris.petal_width)

Out[17]:

    sepal_length  sepal_width  petal_length  petal_width      species  \
0            5.1          3.5           1.4          0.2  Iris-setosa   
1            4.9          3.0           1.4          0.2  Iris-setosa   
2            4.7          3.2           1.3          0.2  Iris-setosa   
3            4.6          3.1           1.5          0.2  Iris-setosa   
4            5.0          3.6           1.4          0.2  Iris-setosa   
5            5.4          3.9           1.7          0.4  Iris-setosa   
6            4.6          3.4           1.4          0.3  Iris-setosa   
7            5.0          3.4           1.5          0.2  Iris-setosa   
8            4.4          2.9           1.4          0.2  Iris-setosa   
9            4.9          3.1           1.5          0.1  Iris-setosa   
10           5.4          3.7           1.5          0.2  Iris-setosa   

    sepal_ratio  petal_ratio  
0      1.457143     7.000000  
1      1.633333     7.000000  
2      1.468750     6.500000  
3      1.483871     7.500000  
4      1.388889     7.000000  
5      1.384615     4.250000  
6      1.352941     4.666667  
7      1.470588     7.500000  
8      1.517241     7.000000  
9      1.580645    15.000000  
...

Split-apply-combine -- `by`¶

Split-apply-combine queries, also known as Group-By, split the table into many groups and then do a reduction on each group. We express these queries in blaze with the by operator

by(column-on-which-to-split, result_name=reduction_on_group())

How many measurements do we have per species?

In [18]:

by(iris.species, count=iris.species.count())

Out[18]:

           species  count
0      Iris-setosa     50
1  Iris-versicolor     50
2   Iris-virginica     50

How many measurements do we have per species and what is the longest petal length per species?

In [19]:

by(iris.species, count=iris.species.count(), 
                 longest_petal=iris.petal_length.max())

Out[19]:

           species  count  longest_petal
0      Iris-setosa     50            1.9
1  Iris-versicolor     50            5.1
2   Iris-virginica     50            6.9

Exercise¶

Write queries to answer the following questions

In [20]:

# What are the longest and shortest sepal_lengths per species?

In [21]:

# What is the difference of longest to shortest sepal length per species

This is similar to how we solve these problems in Pandas¶

So far, everything we've seen is similar to solving problems in Pandas

In [22]:

import pandas as pd
df = pd.read_csv('data/iris.csv')
df.groupby(df.species).sepal_length.min()

Out[22]:

species
Iris-setosa        4.3
Iris-versicolor    4.9
Iris-virginica     4.9
Name: sepal_length, dtype: float64

In fact, for small CSV files like this, Blaze uses Pandas, so one might consider just using Pandas directly.

Blaze becomes more useful when we interact with data stored in different systems like SQL databases in the next section.