<img src="images/continuum_analytics_logo.png" alt="Continuum Logo", align="right", width="30%">,
In this tutorial we'll learn how to use Blaze to discover, migrate, and query data living in other databases. Generally this tutorial will have the following format
odo
- Move data to databaseblaze
- Query data in databaseThis tutorial uses many different libraries that are all available with the Anaconda Distribution. Once you have Anaconda install, please run these commands from a terminal:
$ conda install -y blaze
$ conda install -y bokeh
$ conda install -y odo
nbviewer: http://nbviewer.ipython.org/github/ContinuumIO/pydata-apps/blob/master/Section-1_blaze.ipynb
github: https://github.com/ContinuumIO/pydata-apps
NumPy and Pandas provide accessible, interactive, analytic queries; this is valuable.
import pandas as pd
df = pd.read_csv('iris.csv')
df.head()
Unnamed: 0 | SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|---|
0 | 1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
df.groupby(df.Species).PetalLength.mean() # Average petal length per species
Species setosa 1.462 versicolor 4.260 virginica 5.552 Name: PetalLength, dtype: float64
But as data grows and systems become more complex, moving data and querying data become more difficult. Python already has excellent tools for data that fits in memory, but we want to hook up to data that is inconvenient.
From now on, we're going to assume one of the following:
When in-memory arrays/dataframes cease to be an option, we turn to databases. These live outside of the Python process and so might be less convenient. The open source Python ecosystem includes libraries to interact with these databases and with foreign data in general.
Examples:
sqlalchemy
pyhive
impyla
redshift-sqlalchemy
pymongo
happybase
pyspark
paramiko
pywebhdfs
boto
Today we're going to use some of these indirectly with odo
(was into
) and Blaze. We'll try to point out these libraries as we automate them so that, if you'd like, you can use them independently.
<img src="images/continuum_analytics_logo.png" alt="Continuum Logo", align="right", width="30%">,
odo
(formerly into
)¶Odo migrates data between formats and locations.
Before we can use a database we need to move data into it. The odo
project provides a single consistent interface to move data between formats and between locations.
We'll start with local data and eventually move out to remote data.
Odo moves data into a target from a source
>>> odo(source, target)
The target and source can be either a Python object or a string URI. The following are all valid calls to into
>>> odo('iris.csv', pd.DataFrame) # Load CSV file into new DataFrame
>>> odo(my_df, 'iris.json') # Write DataFrame into JSON file
>>> odo('iris.csv', 'iris.json') # Migrate data from CSV to JSON
Use odo
to load the iris.csv
file into a Python list
, a np.ndarray
, and a pd.DataFrame
from odo import odo
import numpy as np
import pandas as pd
odo("iris.csv", pd.DataFrame)
Unnamed: 0 | SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|---|
0 | 1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
5 | 6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
6 | 7 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
7 | 8 | 5.0 | 3.4 | 1.5 | 0.2 | setosa |
8 | 9 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
9 | 10 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
10 | 11 | 5.4 | 3.7 | 1.5 | 0.2 | setosa |
11 | 12 | 4.8 | 3.4 | 1.6 | 0.2 | setosa |
12 | 13 | 4.8 | 3.0 | 1.4 | 0.1 | setosa |
13 | 14 | 4.3 | 3.0 | 1.1 | 0.1 | setosa |
14 | 15 | 5.8 | 4.0 | 1.2 | 0.2 | setosa |
15 | 16 | 5.7 | 4.4 | 1.5 | 0.4 | setosa |
16 | 17 | 5.4 | 3.9 | 1.3 | 0.4 | setosa |
17 | 18 | 5.1 | 3.5 | 1.4 | 0.3 | setosa |
18 | 19 | 5.7 | 3.8 | 1.7 | 0.3 | setosa |
19 | 20 | 5.1 | 3.8 | 1.5 | 0.3 | setosa |
20 | 21 | 5.4 | 3.4 | 1.7 | 0.2 | setosa |
21 | 22 | 5.1 | 3.7 | 1.5 | 0.4 | setosa |
22 | 23 | 4.6 | 3.6 | 1.0 | 0.2 | setosa |
23 | 24 | 5.1 | 3.3 | 1.7 | 0.5 | setosa |
24 | 25 | 4.8 | 3.4 | 1.9 | 0.2 | setosa |
25 | 26 | 5.0 | 3.0 | 1.6 | 0.2 | setosa |
26 | 27 | 5.0 | 3.4 | 1.6 | 0.4 | setosa |
27 | 28 | 5.2 | 3.5 | 1.5 | 0.2 | setosa |
28 | 29 | 5.2 | 3.4 | 1.4 | 0.2 | setosa |
29 | 30 | 4.7 | 3.2 | 1.6 | 0.2 | setosa |
... | ... | ... | ... | ... | ... | ... |
120 | 121 | 6.9 | 3.2 | 5.7 | 2.3 | virginica |
121 | 122 | 5.6 | 2.8 | 4.9 | 2.0 | virginica |
122 | 123 | 7.7 | 2.8 | 6.7 | 2.0 | virginica |
123 | 124 | 6.3 | 2.7 | 4.9 | 1.8 | virginica |
124 | 125 | 6.7 | 3.3 | 5.7 | 2.1 | virginica |
125 | 126 | 7.2 | 3.2 | 6.0 | 1.8 | virginica |
126 | 127 | 6.2 | 2.8 | 4.8 | 1.8 | virginica |
127 | 128 | 6.1 | 3.0 | 4.9 | 1.8 | virginica |
128 | 129 | 6.4 | 2.8 | 5.6 | 2.1 | virginica |
129 | 130 | 7.2 | 3.0 | 5.8 | 1.6 | virginica |
130 | 131 | 7.4 | 2.8 | 6.1 | 1.9 | virginica |
131 | 132 | 7.9 | 3.8 | 6.4 | 2.0 | virginica |
132 | 133 | 6.4 | 2.8 | 5.6 | 2.2 | virginica |
133 | 134 | 6.3 | 2.8 | 5.1 | 1.5 | virginica |
134 | 135 | 6.1 | 2.6 | 5.6 | 1.4 | virginica |
135 | 136 | 7.7 | 3.0 | 6.1 | 2.3 | virginica |
136 | 137 | 6.3 | 3.4 | 5.6 | 2.4 | virginica |
137 | 138 | 6.4 | 3.1 | 5.5 | 1.8 | virginica |
138 | 139 | 6.0 | 3.0 | 4.8 | 1.8 | virginica |
139 | 140 | 6.9 | 3.1 | 5.4 | 2.1 | virginica |
140 | 141 | 6.7 | 3.1 | 5.6 | 2.4 | virginica |
141 | 142 | 6.9 | 3.1 | 5.1 | 2.3 | virginica |
142 | 143 | 5.8 | 2.7 | 5.1 | 1.9 | virginica |
143 | 144 | 6.8 | 3.2 | 5.9 | 2.3 | virginica |
144 | 145 | 6.7 | 3.3 | 5.7 | 2.5 | virginica |
145 | 146 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
146 | 147 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
147 | 148 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
148 | 149 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
149 | 150 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
150 rows × 6 columns
Odo refers to foreign data either with a Python object like a sqlalchemy.Table
object for a SQL table, or with a string URI, like postgresql://hostname::tablename
.
URI's often take on the following form
protocol://path-to-resource::path-within-resource
Where path-to-resource
might point to a file, a database hostname, etc. while path-within-resource
might refer to a datapath or table name. Note the two main separators
://
separates the protocol on the left (sqlite
, mongodb
, ssh
, hdfs
, hive
, ...)::
separates the path within the database on the right (e.g. tablename)Here are some example URIs
myfile.json
myfiles.*.csv'
postgresql://hostname::tablename
mongodb://hostname/db::collection
ssh://user@host:/path/to/myfile.csv
hdfs://user@host:/path/to/*.csv
Migrate your CSV file into a table named iris
in a new SQLite database at sqlite:///my.db
. Remember to use the ::
separator and to separate your database name from your table name.
odo("iris.csv", "sqlite:///my.db::iris")
Table('iris', MetaData(bind=Engine(sqlite:///my.db)), Column('Unnamed: 0', BigInteger(), table=<iris>, nullable=False), Column('SepalLength', Float(precision=53), table=<iris>), Column('SepalWidth', Float(precision=53), table=<iris>), Column('PetalLength', Float(precision=53), table=<iris>), Column('PetalWidth', Float(precision=53), table=<iris>), Column('Species', Text(), table=<iris>), schema=None)
What kind of object did you get receive as output? Call type
on your result.
type(_)
sqlalchemy.sql.schema.Table
Odo is a network of fast pairwise conversions between pairs of formats. We when we migrate between two formats we traverse a path of pairwise conversions.
We visualize that network below:
Each node represents a data format. Each directed edge represents a function to transform data between two formats. A single call to into may traverse multiple edges and multiple intermediate formats. Red nodes support larger-than-memory data.
A single call to into may traverse several intermediate formats calling on several conversion functions. For example, we when migrate a CSV file to a Mongo database we might take the following route:
DataFrame
(pandas.read_csv
)np.recarray
(DataFrame.to_records
)Iterator
(np.ndarray.tolist
)pymongo.Collection.insert
)Alternatively we could write a special function that uses MongoDB's native CSV
loader and shortcut this entire process with a direct edge CSV -> Mongo
.
These functions are chosen because they are fast, often far faster than converting through a central serialization format.
This picture is actually from an older version of odo
, when the graph was still small enough to visualize pleasantly. See odo docs for a more updated version.
We can interact with remote data in three locations
ssh
For most of this we'll wait until we've seen Blaze, briefly we'll use S3.
For now, we quickly grab a file from Amazon's S3
.
This example depends on boto
to interact with S3.
conda install boto
odo('s3://nyqpug/tips.csv', pd.DataFrame)
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
5 | 25.29 | 4.71 | Male | No | Sun | Dinner | 4 |
6 | 8.77 | 2.00 | Male | No | Sun | Dinner | 2 |
7 | 26.88 | 3.12 | Male | No | Sun | Dinner | 4 |
8 | 15.04 | 1.96 | Male | No | Sun | Dinner | 2 |
9 | 14.78 | 3.23 | Male | No | Sun | Dinner | 2 |
10 | 10.27 | 1.71 | Male | No | Sun | Dinner | 2 |
11 | 35.26 | 5.00 | Female | No | Sun | Dinner | 4 |
12 | 15.42 | 1.57 | Male | No | Sun | Dinner | 2 |
13 | 18.43 | 3.00 | Male | No | Sun | Dinner | 4 |
14 | 14.83 | 3.02 | Female | No | Sun | Dinner | 2 |
15 | 21.58 | 3.92 | Male | No | Sun | Dinner | 2 |
16 | 10.33 | 1.67 | Female | No | Sun | Dinner | 3 |
17 | 16.29 | 3.71 | Male | No | Sun | Dinner | 3 |
18 | 16.97 | 3.50 | Female | No | Sun | Dinner | 3 |
19 | 20.65 | 3.35 | Male | No | Sat | Dinner | 3 |
20 | 17.92 | 4.08 | Male | No | Sat | Dinner | 2 |
21 | 20.29 | 2.75 | Female | No | Sat | Dinner | 2 |
22 | 15.77 | 2.23 | Female | No | Sat | Dinner | 2 |
23 | 39.42 | 7.58 | Male | No | Sat | Dinner | 4 |
24 | 19.82 | 3.18 | Male | No | Sat | Dinner | 2 |
25 | 17.81 | 2.34 | Male | No | Sat | Dinner | 4 |
26 | 13.37 | 2.00 | Male | No | Sat | Dinner | 2 |
27 | 12.69 | 2.00 | Male | No | Sat | Dinner | 2 |
28 | 21.70 | 4.30 | Male | No | Sat | Dinner | 2 |
29 | 19.65 | 3.00 | Female | No | Sat | Dinner | 2 |
... | ... | ... | ... | ... | ... | ... | ... |
214 | 28.17 | 6.50 | Female | Yes | Sat | Dinner | 3 |
215 | 12.90 | 1.10 | Female | Yes | Sat | Dinner | 2 |
216 | 28.15 | 3.00 | Male | Yes | Sat | Dinner | 5 |
217 | 11.59 | 1.50 | Male | Yes | Sat | Dinner | 2 |
218 | 7.74 | 1.44 | Male | Yes | Sat | Dinner | 2 |
219 | 30.14 | 3.09 | Female | Yes | Sat | Dinner | 4 |
220 | 12.16 | 2.20 | Male | Yes | Fri | Lunch | 2 |
221 | 13.42 | 3.48 | Female | Yes | Fri | Lunch | 2 |
222 | 8.58 | 1.92 | Male | Yes | Fri | Lunch | 1 |
223 | 15.98 | 3.00 | Female | No | Fri | Lunch | 3 |
224 | 13.42 | 1.58 | Male | Yes | Fri | Lunch | 2 |
225 | 16.27 | 2.50 | Female | Yes | Fri | Lunch | 2 |
226 | 10.09 | 2.00 | Female | Yes | Fri | Lunch | 2 |
227 | 20.45 | 3.00 | Male | No | Sat | Dinner | 4 |
228 | 13.28 | 2.72 | Male | No | Sat | Dinner | 2 |
229 | 22.12 | 2.88 | Female | Yes | Sat | Dinner | 2 |
230 | 24.01 | 2.00 | Male | Yes | Sat | Dinner | 4 |
231 | 15.69 | 3.00 | Male | Yes | Sat | Dinner | 3 |
232 | 11.61 | 3.39 | Male | No | Sat | Dinner | 2 |
233 | 10.77 | 1.47 | Male | No | Sat | Dinner | 2 |
234 | 15.53 | 3.00 | Male | Yes | Sat | Dinner | 2 |
235 | 10.07 | 1.25 | Male | No | Sat | Dinner | 2 |
236 | 12.60 | 1.00 | Male | Yes | Sat | Dinner | 2 |
237 | 32.83 | 1.17 | Male | Yes | Sat | Dinner | 2 |
238 | 35.83 | 4.67 | Female | No | Sat | Dinner | 3 |
239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 |
240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 |
241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 |
242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 |
243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
244 rows × 7 columns
<img src="images/continuum_analytics_logo.png" alt="Continuum Logo", align="right", width="30%">,
Blaze translates a subset of numpy/pandas syntax into database queries. It hides away the database.
On simple datasets, like CSV files, Blaze acts like Pandas with slightly different syntax. In this case Blaze is just using Pandas.
import pandas as pd
df = pd.read_csv('iris.csv')
df.head(5)
Unnamed: 0 | SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|---|
0 | 1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
df.Species.unique()
array(['setosa', 'versicolor', 'virginica'], dtype=object)
df.Species.drop_duplicates()
0 setosa 50 versicolor 100 virginica Name: Species, dtype: object
import blaze as bz
d = bz.Data('iris.csv')
d.head(5)
Unnamed: 0 | SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|---|
0 | 1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
d.Species.distinct()
Species | |
---|---|
0 | setosa |
1 | versicolor |
2 | virginica |
Blaze does different things under-the-hood on different kinds of data
We'll play with SQL a lot during this tutorial. Blaze translates your query to SQLAlchemy. SQLAlchemy then translates to the SQL dialect of your database, your database then executes that query intelligently.
This translation process lets analysts interact with a familiar interface while leveraging a potentially powerful database.
To keep things local we'll use SQLite, but this works with any database with a SQLAlchemy dialect. Examples in this section use the iris dataset. Exercises use the Lahman Baseball statistics database, year 2013.
If you have not downloaded this dataset you could do so here - https://github.com/jknecht/baseball-archive-sqlite/raw/master/lahman2013.sqlite.
!ls
LICENSE Section_1_blaze.ipynb embedded_apps iris.csv numba_tutorial README.md Section_1_blaze_solutions.ipynb environment.yml lahman2013.sqlite Section_0_Introduction.ipynb Section_2_bokeh.ipynb images my.db
Lets dive into Blaze Syntax. For simple queries it looks and feels similar to Pandas
db = bz.Data('sqlite:///my.db')
#db.iris
#db.iris.head()
db.iris.Species.distinct()
Species | |
---|---|
0 | Species |
1 | setosa |
2 | versicolor |
3 | virginica |
db.iris[db.iris.Species == 'versicolor'][['Species', 'SepalLength']]
Species | SepalLength | |
---|---|---|
0 | versicolor | 7.0 |
1 | versicolor | 6.4 |
2 | versicolor | 6.9 |
3 | versicolor | 5.5 |
4 | versicolor | 6.5 |
5 | versicolor | 5.7 |
6 | versicolor | 6.3 |
7 | versicolor | 4.9 |
8 | versicolor | 6.6 |
9 | versicolor | 5.2 |
10 | versicolor | 5.0 |
If we were using pandas we would read the table into pandas, then use pandas' fast in-memory algorithms for computation. Here we translate your query into SQL and then send that query to the database to do the work.
If we want to dive into the internal API we can inspect the query that Blaze transmits.
# Inspect SQL query
query = db.iris[db.iris.Species == 'versicolor'][['Species', 'SepalLength']]
print bz.compute(query)
SELECT iris."Species", iris."SepalLength" FROM iris WHERE iris."Species" = ?
query = bz.by(db.iris.Species, longest=db.iris.PetalLength.max(),
shortest=db.iris.PetalLength.min())
print bz.compute(query)
SELECT iris."Species", max(iris."PetalLength") AS longest, min(iris."PetalLength") AS shortest FROM iris GROUP BY iris."Species"
odo(query, list)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-18-3f6ae329ee8d> in <module>() ----> 1 odo(query, list) /Users/aterrel/workspace/apps/anaconda/envs/pydata_apps/lib/python2.7/site-packages/odo/odo.pyc in odo(source, target, **kwargs) 88 odo.append.append - Add things onto existing things 89 """ ---> 90 return into(target, source, **kwargs) /Users/aterrel/workspace/apps/anaconda/envs/pydata_apps/lib/python2.7/site-packages/multipledispatch/dispatcher.pyc in __call__(self, *args, **kwargs) 161 self._cache[types] = func 162 try: --> 163 return func(*args, **kwargs) 164 165 except MDNotImplementedError: /Users/aterrel/workspace/apps/anaconda/envs/pydata_apps/lib/python2.7/site-packages/blaze/interactive.pyc in into(a, b, **kwargs) 316 result = compute(b, **kwargs) 317 kwargs['dshape'] = b.dshape --> 318 return into(a, result, **kwargs) 319 320 /Users/aterrel/workspace/apps/anaconda/envs/pydata_apps/lib/python2.7/site-packages/multipledispatch/dispatcher.pyc in __call__(self, *args, **kwargs) 161 self._cache[types] = func 162 try: --> 163 return func(*args, **kwargs) 164 165 except MDNotImplementedError: /Users/aterrel/workspace/apps/anaconda/envs/pydata_apps/lib/python2.7/site-packages/odo/into.pyc in into_type(a, b, **kwargs) 23 if 'dshape' not in kwargs: 24 kwargs['dshape'] = discover(b) ---> 25 return convert(a, b, **kwargs) 26 27 /Users/aterrel/workspace/apps/anaconda/envs/pydata_apps/lib/python2.7/site-packages/odo/core.pyc in __call__(self, *args, **kwargs) 28 29 def __call__(self, *args, **kwargs): ---> 30 return _transform(self.graph, *args, **kwargs) 31 32 /Users/aterrel/workspace/apps/anaconda/envs/pydata_apps/lib/python2.7/site-packages/odo/core.pyc in _transform(graph, target, source, excluded_edges, ooc_types, **kwargs) 44 try: 45 for (A, B, f) in pth: ---> 46 x = f(x, excluded_edges=excluded_edges, **kwargs) 47 return x 48 except NotImplementedError as e: /Users/aterrel/workspace/apps/anaconda/envs/pydata_apps/lib/python2.7/site-packages/odo/convert.pyc in dataframe_to_numpy(df, dshape, **kwargs) 22 x = df.to_records(index=False) 23 if x.dtype != dtype: ---> 24 x = x.astype(dtype) 25 return x 26 ValueError: could not convert string to float: PetalLength
# db = bz.Data('postgresql://postgres:postgres@ec2-54-159-160-163.compute-1.amazonaws.com') # Use Postgres if you don't have the sqlite file
db = bz.Data('sqlite:///lahman2013.sqlite')
db.dshape
dshape("""{ AllstarFull: var * { playerID: ?string, yearID: ?int32, gameNum: ?int32, gameID: ?string, teamID: ?string, lgID: ?string, GP: ?int32, startingPos: ?int32 }, Appearances: var * { yearID: ?int32, teamID: ?string, lgID: ?string, playerID: ?string, G_all: ?int32, GS: ?int32, G_batting: ?int32, G_defense: ?int32, G_p: ?int32, G_c: ?int32, G_1b: ?int32, G_2b: ?int32, G_3b: ?int32, G_ss: ?int32, G_lf: ?int32, G_cf: ?int32, G_rf: ?int32, G_of: ?int32, G_dh: ?int32, G_ph: ?int32, G_pr: ?int32 }, AwardsManagers: var * { playerID: ?string, awardID: ?string, yearID: ?int32, lgID: ?string, tie: ?string, notes: ?string }, AwardsPlayers: var * { playerID: ?string, awardID: ?string, yearID: ?int32, lgID: ?string, tie: ?string, notes: ?string }, AwardsShareManagers: var * { awardID: ?string, yearID: ?int32, lgID: ?string, playerID: ?string, pointsWon: ?int32, pointsMax: ?int32, votesFirst: ?int32 }, AwardsSharePlayers: var * { awardID: ?string, yearID: ?int32, lgID: ?string, playerID: ?string, pointsWon: ?float64, pointsMax: ?int32, votesFirst: ?float64 }, Batting: var * { playerID: ?string, yearID: ?int32, stint: ?int32, teamID: ?string, lgID: ?string, G: ?int32, G_batting: ?int32, AB: ?int32, R: ?int32, H: ?int32, 2B: ?int32, 3B: ?int32, HR: ?int32, RBI: ?int32, SB: ?int32, CS: ?int32, BB: ?int32, SO: ?int32, IBB: ?int32, HBP: ?int32, SH: ?int32, SF: ?int32, GIDP: ?int32, G_old: ?int32 }, BattingPost: var * { yearID: ?int32, round: ?string, playerID: ?string, teamID: ?string, lgID: ?string, G: ?int32, AB: ?int32, R: ?int32, H: ?int32, 2B: ?int32, 3B: ?int32, HR: ?int32, RBI: ?int32, SB: ?int32, CS: ?int32, BB: ?int32, SO: ?int32, IBB: ?int32, HBP: ?int32, SH: ?int32, SF: ?int32, GIDP: ?int32 }, Fielding: var * { playerID: ?string, yearID: ?int32, stint: ?int32, teamID: ?string, lgID: ?string, POS: ?string, G: ?int32, GS: ?int32, InnOuts: ?int32, PO: ?int32, A: ?int32, E: ?int32, DP: ?int32, PB: ?int32, WP: ?int32, SB: ?int32, CS: ?int32, ZR: ?float64 }, FieldingOF: var * { playerID: ?string, yearID: ?int32, stint: ?int32, Glf: ?int32, Gcf: ?int32, Grf: ?int32 }, FieldingPost: var * { playerID: ?string, yearID: ?int32, teamID: ?string, lgID: ?string, round: ?string, POS: ?string, G: ?int32, GS: ?int32, InnOuts: ?int32, PO: ?int32, A: ?int32, E: ?int32, DP: ?int32, TP: ?int32, PB: ?int32, SB: ?int32, CS: ?int32 }, HallOfFame: var * { playerID: ?string, yearid: ?int32, votedBy: ?string, ballots: ?int32, needed: ?int32, votes: ?int32, inducted: ?string, category: ?string, needed_note: ?string }, Managers: var * { playerID: ?string, yearID: ?int32, teamID: ?string, lgID: ?string, inseason: ?int32, G: ?int32, W: ?int32, L: ?int32, rank: ?int32, plyrMgr: ?string }, ManagersHalf: var * { playerID: ?string, yearID: ?int32, teamID: ?string, lgID: ?string, inseason: ?int32, half: ?int32, G: ?int32, W: ?int32, L: ?int32, rank: ?int32 }, Master: var * { playerID: ?string, birthYear: ?int32, birthMonth: ?int32, birthDay: ?int32, birthCountry: ?string, birthState: ?string, birthCity: ?string, deathYear: ?int32, deathMonth: ?int32, deathDay: ?int32, deathCountry: ?string, deathState: ?string, deathCity: ?string, nameFirst: ?string, nameLast: ?string, nameGiven: ?string, weight: ?int32, height: ?float64, bats: ?string, throws: ?string, debut: ?float64, finalGame: ?float64, retroID: ?string, bbrefID: ?string }, Pitching: var * { playerID: ?string, yearID: ?int32, stint: ?int32, teamID: ?string, lgID: ?string, W: ?int32, L: ?int32, G: ?int32, GS: ?int32, CG: ?int32, SHO: ?int32, SV: ?int32, IPouts: ?int32, H: ?int32, ER: ?int32, HR: ?int32, BB: ?int32, SO: ?int32, BAOpp: ?float64, ERA: ?float64, IBB: ?int32, WP: ?int32, HBP: ?int32, BK: ?int32, BFP: ?int32, GF: ?int32, R: ?int32, SH: ?int32, SF: ?int32, GIDP: ?int32 }, PitchingPost: var * { playerID: ?string, yearID: ?int32, round: ?string, teamID: ?string, lgID: ?string, W: ?int32, L: ?int32, G: ?int32, GS: ?int32, CG: ?int32, SHO: ?int32, SV: ?int32, IPouts: ?int32, H: ?int32, ER: ?int32, HR: ?int32, BB: ?int32, SO: ?int32, BAOpp: ?float64, ERA: ?float64, IBB: ?int32, WP: ?int32, HBP: ?int32, BK: ?int32, BFP: ?int32, GF: ?int32, R: ?int32, SH: ?int32, SF: ?int32, GIDP: ?int32 }, Salaries: var * { yearID: ?int32, teamID: ?string, lgID: ?string, playerID: ?string, salary: ?float64 }, Schools: var * { schoolID: ?string, schoolName: ?string, schoolCity: ?string, schoolState: ?string, schoolNick: ?string }, SchoolsPlayers: var * { playerID: ?string, schoolID: ?string, yearMin: ?int32, yearMax: ?int32 }, SeriesPost: var * { yearID: ?int32, round: ?string, teamIDwinner: ?string, lgIDwinner: ?string, teamIDloser: ?string, lgIDloser: ?string, wins: ?int32, losses: ?int32, ties: ?int32 }, Teams: var * { yearID: ?int32, lgID: ?string, teamID: ?string, franchID: ?string, divID: ?string, Rank: ?int32, G: ?int32, Ghome: ?int32, W: ?int32, L: ?int32, DivWin: ?string, WCWin: ?string, LgWin: ?string, WSWin: ?string, R: ?int32, AB: ?int32, H: ?int32, 2B: ?int32, 3B: ?int32, HR: ?int32, BB: ?int32, SO: ?int32, SB: ?int32, CS: ?int32, HBP: ?int32, SF: ?int32, RA: ?int32, ER: ?int32, ERA: ?float64, CG: ?int32, SHO: ?int32, SV: ?int32, IPouts: ?int32, HA: ?int32, HRA: ?int32, BBA: ?int32, SOA: ?int32, E: ?int32, DP: ?int32, FP: ?float64, name: ?string, park: ?string, attendance: ?int32, BPF: ?int32, PPF: ?int32, teamIDBR: ?string, teamIDlahman45: ?string, teamIDretro: ?string }, TeamsFranchises: var * { franchID: ?string, franchName: ?string, active: ?string, NAassoc: ?string }, TeamsHalf: var * { yearID: ?int32, lgID: ?string, teamID: ?string, Half: ?string, divID: ?string, DivWin: ?string, Rank: ?int32, G: ?int32, W: ?int32, L: ?int32 }, temp: var * {ID: ?int32, namefull: ?string, born: ?float64} }""")
# View the Salaries table
# What are the distinct teamIDs in the Salaries table?
# What is the minimum and maximum yearID in the Sarlaries table?
# For the Oakland Athletics (teamID OAK), pick out the playerID, salary, and yearID columns
# Sort that result by salary.
# Use the ascending=False keyword argument to the sort function to find the highest paid players
In Pandas we perform computations on a per-group basis with the groupby
operator. In Blaze our syntax is slightly different, using instead the by
function.
import pandas as pd
iris = pd.read_csv('iris.csv')
iris.groupby('Species').PetalLength.min()
Species setosa 1.0 versicolor 3.0 virginica 4.5 Name: PetalLength, dtype: float64
iris = bz.Data('sqlite:///my.db::iris')
bz.by(iris.Species, largest=iris.PetalLength.max(),
smallest=iris.PetalLength.min())
print(_)
Species setosa 1.0 versicolor 3.0 virginica 4.5 Name: PetalLength, dtype: float64
By default Blaze only shows us the first ten lines of a result. This provides a more interactive feel and stops us from accidentally crushing our system. Sometimes we do want to compute all of the results and store them someplace.
Blaze expressions are valid sources for odo
. So we can store our results in any format.
iris = bz.Data('sqlite:///my.db::iris')
query = bz.by(iris.Species, largest=iris.PetalLength.max(), # A lazily evaluated result
smallest=iris.PetalLength.min())
odo(query, list) # A concrete result
[(u'Species', u'PetalLength', u'PetalLength'), (u'setosa', 1.9, 1.0), (u'versicolor', 5.1, 3.0), (u'virginica', 6.9, 4.5)]
The solution to the first split-apply-combine problem is below. Store that result in a list, a CSV file, and in a new SQL table in our database (use a uri like sqlite://...
to specify the SQL table.)
result = bz.by(db.Salaries.teamID, avg=db.Salaries.salary.mean(),
max=db.Salaries.salary.max(),
ratio=db.Salaries.salary.max() / db.Salaries.salary.min()
).sort('ratio', ascending=False)
odo(result, list)[:10]
[(u'PHI', 2092230.9326364691, 25000000.0, 416.6666666666667), (u'LAN', 2346982.6980255516, 23854494.0, 397.5749), (u'NYN', 2317349.977245509, 23145011.0, 385.7501833333333), (u'DET', 1980834.9902080784, 23000000.0, 383.3333333333333), (u'MIN', 1525031.650385604, 23000000.0, 383.3333333333333), (u'BOS', 2692113.8568075118, 22500000.0, 375.0), (u'SFN', 2044198.6841463414, 22250000.0, 370.8333333333333), (u'TEX', 1874651.551143201, 22000000.0, 366.6666666666667), (u'SEA', 1932288.913202934, 20557143.0, 342.61905), (u'CHA', 1992653.5012315272, 17000000.0, 340.0)]