into
¶conda install -c blaze blaze
We describe "data" with a few attributes
strings
and balances as float32
s)As analysts we care only about point 4, the values that our data represent. Points 1-3 are incidental to how we use computers; these points only get in the way of analysis.
As computationalists though we care very deeply about points 1-3. The choice of format, location, and datatype strongly impact the efficiency and correctness of our computations. Good choices here can mean the difference between waiting overnight and using our data interactively.
Unfortunately points 1-3 encompass a lot of complexity and change more quickly than most analysts care to manage.
The into
project alleviates the pain of dealing with the first three points by providing intuitive description and transfer between data formats and storage systems. This allows analysts to quickly reason about and migrate their data to efficient, correct, and resilient formats.
Before we start small with the tutorial we give a more comprehensive example.
We have a small CSV file holding the iris data
!head -5 data/iris.csv
sepal_length,sepal_width,petal_length,petal_width,species 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa
We put this data into a list, a NumPy array, and a SQLite database. We move data to three very different technologies with the same abstraction.
from into import into
into(list, 'data/iris.csv')[:5]
[(5.1, 3.5, 1.4, 0.2, u'Iris-setosa'), (4.9, 3.0, 1.4, 0.2, u'Iris-setosa'), (4.7, 3.2, 1.3, 0.2, u'Iris-setosa'), (4.6, 3.1, 1.5, 0.2, u'Iris-setosa'), (5.0, 3.6, 1.4, 0.2, u'Iris-setosa')]
import numpy as np
into(np.ndarray, 'data/iris.csv')[:5]
rec.array([(5.1, 3.5, 1.4, 0.2, u'Iris-setosa'), (4.9, 3.0, 1.4, 0.2, u'Iris-setosa'), (4.7, 3.2, 1.3, 0.2, u'Iris-setosa'), (4.6, 3.1, 1.5, 0.2, u'Iris-setosa'), (5.0, 3.6, 1.4, 0.2, u'Iris-setosa')], dtype=[('sepal_length', '<f8'), ('sepal_width', '<f8'), ('petal_length', '<f8'), ('petal_width', '<f8'), ('species', 'O')])
into('sqlite:///data/my.db::iris', 'data/iris.csv')
Table('iris', MetaData(bind=Engine(sqlite:///data/my.db)), Column('sepal_length', FLOAT(), table=<iris>, nullable=False), Column('sepal_width', FLOAT(), table=<iris>, nullable=False), Column('petal_length', FLOAT(), table=<iris>, nullable=False), Column('petal_width', FLOAT(), table=<iris>, nullable=False), Column('species', TEXT(), table=<iris>, nullable=False), schema=None)
Into moves data between formats intuitively.
We structure this tutorial as follows:
into
is available on conda
conda install into
or
conda install into -c blaze # Up-to-date version
or on PyPI
pip install into
or
pip install git+http://github.com/ContinuumIO/into.git # Up-to-date version
into
¶from into import into
Into takes two arguments, a target and a source
into(target, source)
And it turns the source into something like the target
target
can be a type¶In which case it makes a new object of that type
import numpy as np
into(np.ndarray, [1, 2, 3])
array([1, 2, 3])
into(set, [1, 2, 3])
{1, 2, 3}
import pandas as pd
into(pd.Series, (10, 20, 30))
0 10 1 20 2 30 dtype: int64
Use into to turn the following DataFrame into an np.ndarray
and a list
df = pd.DataFrame([['Alice', 100],
['Bob', 200],
['Charlie', 300]], columns=['name', 'balance'])
# Turn df into an np.ndarray
# into(..., ...)
# Turn df into a list
# into(..., ...)
target
can be an object¶In which we append the source to that object.
target = []
into(target, (1, 2, 3))
into(target, (1, 2, 3))
into(target, (1, 2, 3))
target
[1, 2, 3, 1, 2, 3, 1, 2, 3]
Use into
to make a set holding all the data in the following list of DataFrames.
L = [pd.DataFrame({'name': ['Alice', 'Bob'], 'balance': [100, 200]}),
pd.DataFrame({'name': ['Charlie', 'Dan'], 'balance': [300, 400]}),
pd.DataFrame({'name': ['Edith', 'Frank'], 'balance': [500, 600]})]
s = set()
# Use into and some kind of for loop to put all of the data in L into the set s
Repeat the last exercise but append all of the data onto a tuple
. What do you expect to happen?
t = tuple()
It's important to know that into
has limitations.
Many data sources external to Python (like a CSV file) don't have a Python object that we can put as the source or target. In these cases we use string URIs. Examples of strings include
myfile.csv
myfile.json
myfile.hdf5
myfile.hdf5::/data
sqlite:///myfile.db::table-name
postgresql:///user:password@host:port::table-name
...
These can go either in the source or target inputs.
Here we write our dataframe to a CSV file
# Write DataFrame to CSV file
into('accounts.csv', df)
<into.backends.csv.CSV at 0x7fc92844fd50>
# print out text in accounts.csv
!head accounts.csv
name,balance Alice,100 Bob,200 Charlie,300
# Read CSV file into memory as list
into(list, 'accounts.csv')
[(u'Alice', 100), (u'Bob', 200), (u'Charlie', 300)]
Read the contents of the file 'data/iris.csv'
into a pd.DataFrame
. Then write that dataframe to 'data/iris.json'
. Inspect the json data to ensure that it came out correctly.
Write the contents of your json file to a SQLite database using the following URI as target
`sqlite:///data/my.db::iris'
Then read data from that SQLite database into Python to make sure that it arrived safely.
resource
¶Much interesting data lives outside of Python. As we just saw, we often use URIs to specify this kind of data.
Here we load a bit of a SQL database on baseball statistics (download here) into memory as a list
into(list, 'sqlite:///data/lahman2013.sqlite::BattingPost')[:5]
[(1884, u'WS', u'becanbu01', u'NY4', u'AA', 1, 2, 0, 1, 0, 0, 0, 0, 0, None, 0, 0, 0, None, None, None, None), (1884, u'WS', u'bradyst01', u'NY4', u'AA', 3, 10, 1, 0, 0, 0, 0, 0, 0, None, 0, 1, 0, None, None, None, None), (1884, u'WS', u'carrocl01', u'PRO', u'NL', 3, 10, 2, 1, 0, 0, 0, 1, 0, None, 1, 1, 0, None, None, None, None), (1884, u'WS', u'dennyje01', u'PRO', u'NL', 3, 9, 3, 4, 0, 1, 1, 2, 0, None, 0, 3, 0, None, None, None, None), (1884, u'WS', u'esterdu01', u'NY4', u'AA', 3, 10, 0, 3, 1, 0, 0, 0, 1, None, 0, 3, 0, None, None, None, None)]
Now we learn how these strings work so that we can specify many types of external data.
Internally into
uses the function resource
to turn a string into a Python proxy object. Usually these objects don't hold the data themselves. They just serve as useful pointers to where the data lives. In most cases we use other Python projects for proxy objects.
In the case of SQL tables, resource returns a sqlalchemy.Table
object.
from into import resource
t = resource('sqlite:///data/lahman2013.sqlite::BattingPost')
type(t)
sqlalchemy.sql.schema.Table
t
Table('BattingPost', MetaData(bind=Engine(sqlite:///data/lahman2013.sqlite)), Column('yearID', INTEGER(), table=<BattingPost>), Column('round', TEXT(), table=<BattingPost>), Column('playerID', TEXT(), table=<BattingPost>), Column('teamID', TEXT(), table=<BattingPost>), Column('lgID', TEXT(), table=<BattingPost>), Column('G', INTEGER(), table=<BattingPost>), Column('AB', INTEGER(), table=<BattingPost>), Column('R', INTEGER(), table=<BattingPost>), Column('H', INTEGER(), table=<BattingPost>), Column('2B', INTEGER(), table=<BattingPost>), Column('3B', INTEGER(), table=<BattingPost>), Column('HR', INTEGER(), table=<BattingPost>), Column('RBI', INTEGER(), table=<BattingPost>), Column('SB', INTEGER(), table=<BattingPost>), Column('CS', INTEGER(), table=<BattingPost>), Column('BB', INTEGER(), table=<BattingPost>), Column('SO', INTEGER(), table=<BattingPost>), Column('IBB', INTEGER(), table=<BattingPost>), Column('HBP', INTEGER(), table=<BattingPost>), Column('SH', INTEGER(), table=<BattingPost>), Column('SF', INTEGER(), table=<BattingPost>), Column('GIDP', INTEGER(), table=<BattingPost>), schema=None)
We use this object as the into
source.
into(list, t)[:5]
[(1884, u'WS', u'becanbu01', u'NY4', u'AA', 1, 2, 0, 1, 0, 0, 0, 0, 0, None, 0, 0, 0, None, None, None, None), (1884, u'WS', u'bradyst01', u'NY4', u'AA', 3, 10, 1, 0, 0, 0, 0, 0, 0, None, 0, 1, 0, None, None, None, None), (1884, u'WS', u'carrocl01', u'PRO', u'NL', 3, 10, 2, 1, 0, 0, 0, 1, 0, None, 1, 1, 0, None, None, None, None), (1884, u'WS', u'dennyje01', u'PRO', u'NL', 3, 9, 3, 4, 0, 1, 1, 2, 0, None, 0, 3, 0, None, None, None, None), (1884, u'WS', u'esterdu01', u'NY4', u'AA', 3, 10, 0, 3, 1, 0, 0, 0, 1, None, 0, 3, 0, None, None, None, None)]
So when you write
into(list, 'some-string')
It is actually just shorthand for
into(list, resource('some-string'))
We have some data in the data/
directory. Use resource
on each of the following strings to see what it returns.
sqlite:///data/lahman2013.sqlite::BattingPost
sqlite:///data/lahman2013.sqlite::Salaries
sqlite:///data/lahman2013.sqlite
data/sample.hdf5::/points
data/sample.hdf5
data/iris.csv
::
separator¶In the last exercise we saw the following URI for a table in a SQLite database
sqlite:///data/my.db::iris
We deconstruct this URI to make it more clear. First we separate the URI on ::
to separate out the database from the table name
Database: sqlite:///data/my.db
Table name: iris
We use the ::
separator whenever datasets live within some nested structure like a database or HDF5 file.
://
¶The database string sqlite:///data/my.db
is specific to SQLAlchemy, but follows a common format, notably
Protocol: sqlite://
Filename: data/my.db
Into also uses protocols in many cases to give extra hints on how to handle your data. For example Python has a few different libraries to handle HDF5 files (h5py
, pytables
, pandas.HDFStore
). By default when we see a URI like myfile.hdf5
we currently use h5py
. To override this behavior you can specify a protocol string like hdfstore://myfile.hdf5
to specify that you want to use the special pandas.HDFStore
format.
Note: SQLAlchemy strings are a little odd in that they use three slashes by default (e.g. sqlite:///my.db
) and four slashes when using absolute paths (e.g. sqlite:////Users/Alice/data/my.db
).
People use the .json
extension in two ways.
Parsers have a hard time figuring out which case is which. When reading an existing file into
can usually figure out if the file is line-delimited or not. When createing a file however we don't know what your intention is. You can specify your intention by adding either a json://
or jsonlines://
protocol.
Here we write our dataframe to a JSON file.
into('accounts.json', df)
!head accounts.json
[{"balance": 100, "name": "Alice"}, {"balance": 200, "name": "Bob"}, {"balance": 300, "name": "Charlie"}]
!rm accounts.json # Remove old file
This is the traditional single-JSON-blob-per-file format.
Instead write our DataFrame in the line-delimited format by adding a jsonlines://
protocol to the target string. Inspect the result to make sure that each line is a separate valid JSON blob.