Where we store data and how we interact with it are unfortunately related.
For example, we often access SQL databases using SQL query strings
import sqlalchemy
engine = sqlalchemy.create_engine('sqlite:////home/mrocklin/workspace/blaze/blaze/examples/data/iris.db')
engine
Engine(sqlite:////home/mrocklin/workspace/blaze/blaze/examples/data/iris.db)
conn = engine.connect()
list(conn.execute('''SELECT petal_length, petal_width, sepal_length, sepal_width, species
FROM iris
LIMIT 10'''))
[(1.4, 0.2, 5.1, 3.5, u'Iris-setosa'), (1.4, 0.2, 4.9, 3.0, u'Iris-setosa'), (1.3, 0.2, 4.7, 3.2, u'Iris-setosa'), (1.5, 0.2, 4.6, 3.1, u'Iris-setosa'), (1.4, 0.2, 5.0, 3.6, u'Iris-setosa'), (1.7, 0.4, 5.4, 3.9, u'Iris-setosa'), (1.4, 0.3, 4.6, 3.4, u'Iris-setosa'), (1.5, 0.2, 5.0, 3.4, u'Iris-setosa'), (1.4, 0.2, 4.4, 2.9, u'Iris-setosa'), (1.5, 0.1, 4.9, 3.1, u'Iris-setosa')]
list(conn.execute('''SELECT avg(petal_length), max(petal_width), species
FROM iris
GROUP BY species'''))
[(1.4620000000000002, 0.6, u'Iris-setosa'), (4.26, 1.8, u'Iris-versicolor'), (5.552, 2.5, u'Iris-virginica')]
Many Python users prefer the interactive Pandas DataFrame. We can use Pandas on this data by copying the entire table (actually quite small in this case) into memory and manipulate it directly with Pandas syntax and algorithms.
import pandas as pd
df = pd.read_sql('SELECT * FROM iris', engine)
df
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
5 | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa |
6 | 4.6 | 3.4 | 1.4 | 0.3 | Iris-setosa |
7 | 5.0 | 3.4 | 1.5 | 0.2 | Iris-setosa |
8 | 4.4 | 2.9 | 1.4 | 0.2 | Iris-setosa |
9 | 4.9 | 3.1 | 1.5 | 0.1 | Iris-setosa |
10 | 5.4 | 3.7 | 1.5 | 0.2 | Iris-setosa |
11 | 4.8 | 3.4 | 1.6 | 0.2 | Iris-setosa |
12 | 4.8 | 3.0 | 1.4 | 0.1 | Iris-setosa |
13 | 4.3 | 3.0 | 1.1 | 0.1 | Iris-setosa |
14 | 5.8 | 4.0 | 1.2 | 0.2 | Iris-setosa |
15 | 5.7 | 4.4 | 1.5 | 0.4 | Iris-setosa |
16 | 5.4 | 3.9 | 1.3 | 0.4 | Iris-setosa |
17 | 5.1 | 3.5 | 1.4 | 0.3 | Iris-setosa |
18 | 5.7 | 3.8 | 1.7 | 0.3 | Iris-setosa |
19 | 5.1 | 3.8 | 1.5 | 0.3 | Iris-setosa |
20 | 5.4 | 3.4 | 1.7 | 0.2 | Iris-setosa |
21 | 5.1 | 3.7 | 1.5 | 0.4 | Iris-setosa |
22 | 4.6 | 3.6 | 1.0 | 0.2 | Iris-setosa |
23 | 5.1 | 3.3 | 1.7 | 0.5 | Iris-setosa |
24 | 4.8 | 3.4 | 1.9 | 0.2 | Iris-setosa |
25 | 5.0 | 3.0 | 1.6 | 0.2 | Iris-setosa |
26 | 5.0 | 3.4 | 1.6 | 0.4 | Iris-setosa |
27 | 5.2 | 3.5 | 1.5 | 0.2 | Iris-setosa |
28 | 5.2 | 3.4 | 1.4 | 0.2 | Iris-setosa |
29 | 4.7 | 3.2 | 1.6 | 0.2 | Iris-setosa |
... | ... | ... | ... | ... | ... |
120 | 6.9 | 3.2 | 5.7 | 2.3 | Iris-virginica |
121 | 5.6 | 2.8 | 4.9 | 2.0 | Iris-virginica |
122 | 7.7 | 2.8 | 6.7 | 2.0 | Iris-virginica |
123 | 6.3 | 2.7 | 4.9 | 1.8 | Iris-virginica |
124 | 6.7 | 3.3 | 5.7 | 2.1 | Iris-virginica |
125 | 7.2 | 3.2 | 6.0 | 1.8 | Iris-virginica |
126 | 6.2 | 2.8 | 4.8 | 1.8 | Iris-virginica |
127 | 6.1 | 3.0 | 4.9 | 1.8 | Iris-virginica |
128 | 6.4 | 2.8 | 5.6 | 2.1 | Iris-virginica |
129 | 7.2 | 3.0 | 5.8 | 1.6 | Iris-virginica |
130 | 7.4 | 2.8 | 6.1 | 1.9 | Iris-virginica |
131 | 7.9 | 3.8 | 6.4 | 2.0 | Iris-virginica |
132 | 6.4 | 2.8 | 5.6 | 2.2 | Iris-virginica |
133 | 6.3 | 2.8 | 5.1 | 1.5 | Iris-virginica |
134 | 6.1 | 2.6 | 5.6 | 1.4 | Iris-virginica |
135 | 7.7 | 3.0 | 6.1 | 2.3 | Iris-virginica |
136 | 6.3 | 3.4 | 5.6 | 2.4 | Iris-virginica |
137 | 6.4 | 3.1 | 5.5 | 1.8 | Iris-virginica |
138 | 6.0 | 3.0 | 4.8 | 1.8 | Iris-virginica |
139 | 6.9 | 3.1 | 5.4 | 2.1 | Iris-virginica |
140 | 6.7 | 3.1 | 5.6 | 2.4 | Iris-virginica |
141 | 6.9 | 3.1 | 5.1 | 2.3 | Iris-virginica |
142 | 5.8 | 2.7 | 5.1 | 1.9 | Iris-virginica |
143 | 6.8 | 3.2 | 5.9 | 2.3 | Iris-virginica |
144 | 6.7 | 3.3 | 5.7 | 2.5 | Iris-virginica |
145 | 6.7 | 3.0 | 5.2 | 2.3 | Iris-virginica |
146 | 6.3 | 2.5 | 5.0 | 1.9 | Iris-virginica |
147 | 6.5 | 3.0 | 5.2 | 2.0 | Iris-virginica |
148 | 6.2 | 3.4 | 5.4 | 2.3 | Iris-virginica |
149 | 5.9 | 3.0 | 5.1 | 1.8 | Iris-virginica |
150 rows × 5 columns
Pandas provides both a great user experience and fast in-memory algorithms. When those algorithms become obsolete (e.g. when datasets grow large) then we're forced to throw away the great user experience and switch back to using SQL.
Here we connect to Hive, a database backed by Hadoop MapReduce.
engine = sqlalchemy.create_engine('hive://hdfs@54.91.57.226:10000/default')
conn = engine.connect()
list(conn.execute('''SELECT *
FROM iris
LIMIT 10''')) # Imagine that this was big
[(5.099999904632568, 3.5, 1.399999976158142, 0.20000000298023224, u'Iris-setosa'), (4.900000095367432, 3.0, 1.399999976158142, 0.20000000298023224, u'Iris-setosa'), (4.699999809265137, 3.200000047683716, 1.2999999523162842, 0.20000000298023224, u'Iris-setosa'), (4.599999904632568, 3.0999999046325684, 1.5, 0.20000000298023224, u'Iris-setosa'), (5.0, 3.5999999046325684, 1.399999976158142, 0.20000000298023224, u'Iris-setosa'), (5.400000095367432, 3.9000000953674316, 1.7000000476837158, 0.4000000059604645, u'Iris-setosa'), (4.599999904632568, 3.4000000953674316, 1.399999976158142, 0.30000001192092896, u'Iris-setosa'), (5.0, 3.4000000953674316, 1.5, 0.20000000298023224, u'Iris-setosa'), (4.400000095367432, 2.9000000953674316, 1.399999976158142, 0.20000000298023224, u'Iris-setosa'), (4.900000095367432, 3.0999999046325684, 1.5, 0.10000000149011612, u'Iris-setosa')]
We get the best of both worlds
from blaze import Data, by
d = Data('hive://hdfs@54.91.57.226:10000/default')
d.iris
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
5 | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa |
6 | 4.6 | 3.4 | 1.4 | 0.3 | Iris-setosa |
7 | 5.0 | 3.4 | 1.5 | 0.2 | Iris-setosa |
8 | 4.4 | 2.9 | 1.4 | 0.2 | Iris-setosa |
9 | 4.9 | 3.1 | 1.5 | 0.1 | Iris-setosa |
10 | 5.4 | 3.7 | 1.5 | 0.2 | Iris-setosa |
by(d.iris.species, largest=d.iris.sepal_length.max(),
smallest=d.iris.sepal_length.min())
species | largest | smallest | |
---|---|---|---|
0 | Iris-setosa | 5.8 | 4.3 |
1 | Iris-versicolor | 7.0 | 4.9 |
2 | Iris-virginica | 7.9 | 4.9 |
Here we use the internal API to show the translated Blaze query that Blaze sends to the Hive database.
from blaze import compute
query = by(d.iris.species, largest=d.iris.sepal_length.max(),
smallest=d.iris.sepal_length.min())
print compute(query)
SELECT iris.species, max(iris.sepal_length) AS largest, min(iris.sepal_length) AS smallest FROM iris GROUP BY iris.species