Blaze Logo

Getting started with `into`¶

Full tutorial available at http://github.com/ContinuumIO/blaze-tutorial
Install software with conda install -c blaze blaze

0. Introduction¶

Describing Data¶

We describe "data" with a few attributes

Where that data lives (in a file on your laptop, in the cloud, etc..)
What data-types we use (we represent names as strings and balances as float32s)
What storage format we use (we store in CSV, in a PostgreSQL database, in JSON)
The semantic values that those bits represent (Barack Obama, or the number 3)

As analysts we care only about point 4, the values that our data represent. Points 1-3 are incidental to how we use computers; these points only get in the way of analysis.

As computationalists though we care very deeply about points 1-3. The choice of format, location, and datatype strongly impact the efficiency and correctness of our computations. Good choices here can mean the difference between waiting overnight and using our data interactively.

Unfortunately points 1-3 encompass a lot of complexity and change more quickly than most analysts care to manage.

The into project alleviates the pain of dealing with the first three points by providing intuitive description and transfer between data formats and storage systems. This allows analysts to quickly reason about and migrate their data to efficient, correct, and resilient formats.

Motivating Example¶

Before we start small with the tutorial we give a more comprehensive example.

We have a small CSV file holding the iris data

In [1]:

!head -5 data/iris.csv

sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa

We put this data into a list, a NumPy array, and a SQLite database. We move data to three very different technologies with the same abstraction.

In [2]:

from into import into
into(list, 'data/iris.csv')[:5]

Out[2]:

[(5.1, 3.5, 1.4, 0.2, u'Iris-setosa'),
 (4.9, 3.0, 1.4, 0.2, u'Iris-setosa'),
 (4.7, 3.2, 1.3, 0.2, u'Iris-setosa'),
 (4.6, 3.1, 1.5, 0.2, u'Iris-setosa'),
 (5.0, 3.6, 1.4, 0.2, u'Iris-setosa')]

In [3]:

import numpy as np
into(np.ndarray, 'data/iris.csv')[:5]

Out[3]:

rec.array([(5.1, 3.5, 1.4, 0.2, u'Iris-setosa'),
       (4.9, 3.0, 1.4, 0.2, u'Iris-setosa'),
       (4.7, 3.2, 1.3, 0.2, u'Iris-setosa'),
       (4.6, 3.1, 1.5, 0.2, u'Iris-setosa'),
       (5.0, 3.6, 1.4, 0.2, u'Iris-setosa')], 
      dtype=[('sepal_length', '<f8'), ('sepal_width', '<f8'), ('petal_length', '<f8'), ('petal_width', '<f8'), ('species', 'O')])

In [4]:

into('sqlite:///data/my.db::iris', 'data/iris.csv')

Out[4]:

Table('iris', MetaData(bind=Engine(sqlite:///data/my.db)), Column('sepal_length', FLOAT(), table=<iris>, nullable=False), Column('sepal_width', FLOAT(), table=<iris>, nullable=False), Column('petal_length', FLOAT(), table=<iris>, nullable=False), Column('petal_width', FLOAT(), table=<iris>, nullable=False), Column('species', TEXT(), table=<iris>, nullable=False), schema=None)

Outline¶

Into moves data between formats intuitively.

We structure this tutorial as follows:

Basic How-to: on how to use this library effectively
Datatypes: to enhance performance
Internals

into is available on conda

conda install into
or 
conda install into -c blaze   # Up-to-date version

or on PyPI

pip install into
or 
pip install git+http://github.com/ContinuumIO/into.git  # Up-to-date version

1. Basic Usage¶

1.1 `into`¶

In [5]:

from into import into

Into takes two arguments, a target and a source

into(target, source)

And it turns the source into something like the target

The `target` can be a type¶

In which case it makes a new object of that type

In [6]:

import numpy as np
into(np.ndarray, [1, 2, 3])

Out[6]:

array([1, 2, 3])

In [7]:

into(set, [1, 2, 3])

Out[7]:

{1, 2, 3}

In [8]:

import pandas as pd
into(pd.Series, (10, 20, 30))

Out[8]:

0    10
1    20
2    30
dtype: int64

Exercise¶

Use into to turn the following DataFrame into an np.ndarray and a list

In [9]:

df = pd.DataFrame([['Alice',   100],
                   ['Bob',     200],
                   ['Charlie', 300]], columns=['name', 'balance'])

In [10]:

# Turn df into an np.ndarray
# into(..., ...)

In [11]:

# Turn df into a list
# into(..., ...)

In [11]:

The `target` can be an object¶

In which we append the source to that object.

In [12]:

target = []
into(target, (1, 2, 3))
into(target, (1, 2, 3))
into(target, (1, 2, 3))
target

Out[12]:

[1, 2, 3, 1, 2, 3, 1, 2, 3]

Exercise¶

Use into to make a set holding all the data in the following list of DataFrames.

In [13]:

L = [pd.DataFrame({'name': ['Alice', 'Bob'], 'balance': [100, 200]}),
     pd.DataFrame({'name': ['Charlie', 'Dan'], 'balance': [300, 400]}),
     pd.DataFrame({'name': ['Edith', 'Frank'], 'balance': [500, 600]})]

In [14]:

s = set()
# Use into and some kind of for loop to put all of the data in L into the set s

In [14]:

Exercise¶

Repeat the last exercise but append all of the data onto a tuple. What do you expect to happen?

In [15]:

t = tuple()

In [15]:

It's important to know that into has limitations.

The target can be a string¶

Many data sources external to Python (like a CSV file) don't have a Python object that we can put as the source or target. In these cases we use string URIs. Examples of strings include

myfile.csv
myfile.json
myfile.hdf5
myfile.hdf5::/data
sqlite:///myfile.db::table-name
postgresql:///user:password@host:port::table-name
...

These can go either in the source or target inputs.

Here we write our dataframe to a CSV file

In [16]:

# Write DataFrame to CSV file
into('accounts.csv', df)

Out[16]:

<into.backends.csv.CSV at 0x7fc92844fd50>

In [17]:

# print out text in accounts.csv
!head accounts.csv

name,balance
Alice,100
Bob,200
Charlie,300

In [18]:

# Read CSV file into memory as list
into(list, 'accounts.csv')

Out[18]:

[(u'Alice', 100), (u'Bob', 200), (u'Charlie', 300)]

Exercise¶

Read the contents of the file 'data/iris.csv' into a pd.DataFrame. Then write that dataframe to 'data/iris.json'. Inspect the json data to ensure that it came out correctly.

In [18]:

Exercise¶

Write the contents of your json file to a SQLite database using the following URI as target

`sqlite:///data/my.db::iris'

Then read data from that SQLite database into Python to make sure that it arrived safely.

In [18]:

1.2 `resource`¶

Much interesting data lives outside of Python. As we just saw, we often use URIs to specify this kind of data.

Here we load a bit of a SQL database on baseball statistics (download here) into memory as a list

In [19]:

into(list, 'sqlite:///data/lahman2013.sqlite::BattingPost')[:5]

Out[19]:

[(1884, u'WS', u'becanbu01', u'NY4', u'AA', 1, 2, 0, 1, 0, 0, 0, 0, 0, None, 0, 0, 0, None, None, None, None),
 (1884, u'WS', u'bradyst01', u'NY4', u'AA', 3, 10, 1, 0, 0, 0, 0, 0, 0, None, 0, 1, 0, None, None, None, None),
 (1884, u'WS', u'carrocl01', u'PRO', u'NL', 3, 10, 2, 1, 0, 0, 0, 1, 0, None, 1, 1, 0, None, None, None, None),
 (1884, u'WS', u'dennyje01', u'PRO', u'NL', 3, 9, 3, 4, 0, 1, 1, 2, 0, None, 0, 3, 0, None, None, None, None),
 (1884, u'WS', u'esterdu01', u'NY4', u'AA', 3, 10, 0, 3, 1, 0, 0, 0, 1, None, 0, 3, 0, None, None, None, None)]

Now we learn how these strings work so that we can specify many types of external data.

Internally into uses the function resource to turn a string into a Python proxy object. Usually these objects don't hold the data themselves. They just serve as useful pointers to where the data lives. In most cases we use other Python projects for proxy objects.

In the case of SQL tables, resource returns a sqlalchemy.Table object.

In [20]:

from into import resource
t = resource('sqlite:///data/lahman2013.sqlite::BattingPost')

In [21]:

type(t)

Out[21]:

sqlalchemy.sql.schema.Table

In [22]:

Out[22]:

Table('BattingPost', MetaData(bind=Engine(sqlite:///data/lahman2013.sqlite)), Column('yearID', INTEGER(), table=<BattingPost>), Column('round', TEXT(), table=<BattingPost>), Column('playerID', TEXT(), table=<BattingPost>), Column('teamID', TEXT(), table=<BattingPost>), Column('lgID', TEXT(), table=<BattingPost>), Column('G', INTEGER(), table=<BattingPost>), Column('AB', INTEGER(), table=<BattingPost>), Column('R', INTEGER(), table=<BattingPost>), Column('H', INTEGER(), table=<BattingPost>), Column('2B', INTEGER(), table=<BattingPost>), Column('3B', INTEGER(), table=<BattingPost>), Column('HR', INTEGER(), table=<BattingPost>), Column('RBI', INTEGER(), table=<BattingPost>), Column('SB', INTEGER(), table=<BattingPost>), Column('CS', INTEGER(), table=<BattingPost>), Column('BB', INTEGER(), table=<BattingPost>), Column('SO', INTEGER(), table=<BattingPost>), Column('IBB', INTEGER(), table=<BattingPost>), Column('HBP', INTEGER(), table=<BattingPost>), Column('SH', INTEGER(), table=<BattingPost>), Column('SF', INTEGER(), table=<BattingPost>), Column('GIDP', INTEGER(), table=<BattingPost>), schema=None)

We use this object as the into source.

In [23]:

into(list, t)[:5]

Out[23]:

[(1884, u'WS', u'becanbu01', u'NY4', u'AA', 1, 2, 0, 1, 0, 0, 0, 0, 0, None, 0, 0, 0, None, None, None, None),
 (1884, u'WS', u'bradyst01', u'NY4', u'AA', 3, 10, 1, 0, 0, 0, 0, 0, 0, None, 0, 1, 0, None, None, None, None),
 (1884, u'WS', u'carrocl01', u'PRO', u'NL', 3, 10, 2, 1, 0, 0, 0, 1, 0, None, 1, 1, 0, None, None, None, None),
 (1884, u'WS', u'dennyje01', u'PRO', u'NL', 3, 9, 3, 4, 0, 1, 1, 2, 0, None, 0, 3, 0, None, None, None, None),
 (1884, u'WS', u'esterdu01', u'NY4', u'AA', 3, 10, 0, 3, 1, 0, 0, 0, 1, None, 0, 3, 0, None, None, None, None)]

So when you write

into(list, 'some-string')

It is actually just shorthand for

into(list, resource('some-string'))

Exercise¶

We have some data in the data/ directory. Use resource on each of the following strings to see what it returns.

sqlite:///data/lahman2013.sqlite::BattingPost
sqlite:///data/lahman2013.sqlite::Salaries
sqlite:///data/lahman2013.sqlite
data/sample.hdf5::/points
data/sample.hdf5
data/iris.csv

In [ ]:

The `::` separator¶

In the last exercise we saw the following URI for a table in a SQLite database

sqlite:///data/my.db::iris

We deconstruct this URI to make it more clear. First we separate the URI on :: to separate out the database from the table name

Database:    sqlite:///data/my.db
Table name:  iris

We use the :: separator whenever datasets live within some nested structure like a database or HDF5 file.

Specifying protocols with `://`¶

The database string sqlite:///data/my.db is specific to SQLAlchemy, but follows a common format, notably

Protocol:  sqlite://
Filename:  data/my.db

Into also uses protocols in many cases to give extra hints on how to handle your data. For example Python has a few different libraries to handle HDF5 files (h5py, pytables, pandas.HDFStore). By default when we see a URI like myfile.hdf5 we currently use h5py. To override this behavior you can specify a protocol string like hdfstore://myfile.hdf5 to specify that you want to use the special pandas.HDFStore format.

Note: SQLAlchemy strings are a little odd in that they use three slashes by default (e.g. sqlite:///my.db) and four slashes when using absolute paths (e.g. sqlite:////Users/Alice/data/my.db).

Exercise¶

People use the .json extension in two ways.

The entire file is one JSON blob, often a list of dictionaries. (Traditional JSON)
Each line of the file is one JSON blob (Line-delimited JSON or JSON-lines.)

Parsers have a hard time figuring out which case is which. When reading an existing file into can usually figure out if the file is line-delimited or not. When createing a file however we don't know what your intention is. You can specify your intention by adding either a json:// or jsonlines:// protocol.

Here we write our dataframe to a JSON file.

In [24]:

into('accounts.json', df)
!head accounts.json

[{"balance": 100, "name": "Alice"}, {"balance": 200, "name": "Bob"}, {"balance": 300, "name": "Charlie"}]

In [25]:

!rm accounts.json  # Remove old file

This is the traditional single-JSON-blob-per-file format.

Instead write our DataFrame in the line-delimited format by adding a jsonlines:// protocol to the target string. Inspect the result to make sure that each line is a separate valid JSON blob.

In [ ]:

Getting started with into¶

0. Introduction¶

Describing Data¶

Motivating Example¶

Outline¶

1. Basic Usage¶

1.1 into¶

The target can be a type¶

Exercise¶

The target can be an object¶

Exercise¶

Exercise¶

The target can be a string¶

Exercise¶

Exercise¶

1.2 resource¶

Exercise¶

The :: separator¶

Specifying protocols with ://¶

Exercise¶

Getting started with `into`¶

1.1 `into`¶

The `target` can be a type¶

The `target` can be an object¶

1.2 `resource`¶

The `::` separator¶

Specifying protocols with `://`¶