into
¶conda install -c blaze blaze
int32
or int64
? int8
?As we encode values as bits we make choices; those choices can affect performance. We encode how to convert values to bits and back as a datatype. You've seen data types before in many forms including C types like long
, double
and double[100]
, numpy dtypes like i4
and f8
or Python types like int
, and float
. Other systems like SQL, HDF5, etc. have similar datatype systems with different names.
To manage datatypes across different systems we use datashape
a datatype system that maps cleanly on to all systems with which into
interacts. This one system can translate into any of the others.
In this section we'll talk about the following
discover
¶We introduce datashape, an all-encompassing datatype language, and discover
, a function that does all of the work for you.
The discover function returns the datashape of an object. Lets look at a few examples.
from into import discover, into, resource
discover(1)
ctype("int64")
discover([1, 2, 3])
dshape("3 * int64")
discover([[1, 2, 3],
[4, 5, 6]])
dshape("2 * 3 * int64")
discover([{'x': 1, 'y': 1.0},
{'x': 2, 'y': 2.0},
{'x': 3, 'y': 3.0}])
dshape("3 * {x: int64, y: float64}")
import pandas as pd
df = pd.DataFrame([['Alice', 100],
['Bob', 200],
['Charlie', 300]], columns=['name', 'balance'])
discover(df)
dshape("3 * {name: string, balance: int64}")
By looking closely at these examples we see the structure of datashape. Elements have types like int64
or string
. Records/structs/groups of elements have record dtypes like {x: int64, y: float64}
. Lengths of collections are encoded by numbers like 3 *
for "three of" or 2 * 3 *
for "a two-by-three grid of".
Construct data with the following datashapes. Use any container type (e.g. list
, pd.DataFrame
, np.ndarray
).
2 * int64
2 * string
{name: string, id: int}
datetime
{name: string, id: int, payments: 2 * datetime}
2 * {name: string, id: int, payments: 2 * datetime}
5 * 5 * 5 * float32
Use discover
to verify your answers.
# Should be 2 * int64
discover([1, 2])
dshape("2 * int64")
The discover
function doesn't care if your data lives in a Python list, Pandas DataFrame, NumPy Array, CSV file, PySpark RDD, or SQL database.
In other words, using into
preserves datashape.
discover(df)
dshape("3 * {name: string, balance: int64}")
import numpy as np
x = into(np.ndarray, df)
discover(x) # different container, same datashape
dshape("3 * {name: string, balance: int64}")
t = into('sqlite:///:memory:::mydf', df)
discover(t) # different container, mostly the same datashape
dshape("var * {name: string, balance: int64}")
Call discover on a single table of our baseball statistics database.
salaries = resource('sqlite:///data/lahman2013.sqlite::Salaries')
discover(salaries)
dshape("""var * { yearID: ?int32, teamID: ?string, lgID: ?string, playerID: ?string, salary: ?float64 }""")
And then call it on the entire database
db = resource('sqlite:///data/lahman2013.sqlite')
discover(db)
dshape("""{ AllstarFull: var * { playerID: ?string, yearID: ?int32, gameNum: ?int32, gameID: ?string, teamID: ?string, lgID: ?string, GP: ?int32, startingPos: ?int32 }, Appearances: var * { yearID: ?int32, teamID: ?string, lgID: ?string, playerID: ?string, G_all: ?int32, GS: ?int32, G_batting: ?int32, G_defense: ?int32, G_p: ?int32, G_c: ?int32, G_1b: ?int32, G_2b: ?int32, G_3b: ?int32, G_ss: ?int32, G_lf: ?int32, G_cf: ?int32, G_rf: ?int32, G_of: ?int32, G_dh: ?int32, G_ph: ?int32, G_pr: ?int32 }, AwardsManagers: var * { playerID: ?string, awardID: ?string, yearID: ?int32, lgID: ?string, tie: ?string, notes: ?string }, AwardsPlayers: var * { playerID: ?string, awardID: ?string, yearID: ?int32, lgID: ?string, tie: ?string, notes: ?string }, AwardsShareManagers: var * { awardID: ?string, yearID: ?int32, lgID: ?string, playerID: ?string, pointsWon: ?int32, pointsMax: ?int32, votesFirst: ?int32 }, AwardsSharePlayers: var * { awardID: ?string, yearID: ?int32, lgID: ?string, playerID: ?string, pointsWon: ?float64, pointsMax: ?int32, votesFirst: ?float64 }, Batting: var * { playerID: ?string, yearID: ?int32, stint: ?int32, teamID: ?string, lgID: ?string, G: ?int32, G_batting: ?int32, AB: ?int32, R: ?int32, H: ?int32, 2B: ?int32, 3B: ?int32, HR: ?int32, RBI: ?int32, SB: ?int32, CS: ?int32, BB: ?int32, SO: ?int32, IBB: ?int32, HBP: ?int32, SH: ?int32, SF: ?int32, GIDP: ?int32, G_old: ?int32 }, BattingPost: var * { yearID: ?int32, round: ?string, playerID: ?string, teamID: ?string, lgID: ?string, G: ?int32, AB: ?int32, R: ?int32, H: ?int32, 2B: ?int32, 3B: ?int32, HR: ?int32, RBI: ?int32, SB: ?int32, CS: ?int32, BB: ?int32, SO: ?int32, IBB: ?int32, HBP: ?int32, SH: ?int32, SF: ?int32, GIDP: ?int32 }, Fielding: var * { playerID: ?string, yearID: ?int32, stint: ?int32, teamID: ?string, lgID: ?string, POS: ?string, G: ?int32, GS: ?int32, InnOuts: ?int32, PO: ?int32, A: ?int32, E: ?int32, DP: ?int32, PB: ?int32, WP: ?int32, SB: ?int32, CS: ?int32, ZR: ?float64 }, FieldingOF: var * { playerID: ?string, yearID: ?int32, stint: ?int32, Glf: ?int32, Gcf: ?int32, Grf: ?int32 }, FieldingPost: var * { playerID: ?string, yearID: ?int32, teamID: ?string, lgID: ?string, round: ?string, POS: ?string, G: ?int32, GS: ?int32, InnOuts: ?int32, PO: ?int32, A: ?int32, E: ?int32, DP: ?int32, TP: ?int32, PB: ?int32, SB: ?int32, CS: ?int32 }, HallOfFame: var * { playerID: ?string, yearid: ?int32, votedBy: ?string, ballots: ?int32, needed: ?int32, votes: ?int32, inducted: ?string, category: ?string, needed_note: ?string }, Managers: var * { playerID: ?string, yearID: ?int32, teamID: ?string, lgID: ?string, inseason: ?int32, G: ?int32, W: ?int32, L: ?int32, rank: ?int32, plyrMgr: ?string }, ManagersHalf: var * { playerID: ?string, yearID: ?int32, teamID: ?string, lgID: ?string, inseason: ?int32, half: ?int32, G: ?int32, W: ?int32, L: ?int32, rank: ?int32 }, Master: var * { playerID: ?string, birthYear: ?int32, birthMonth: ?int32, birthDay: ?int32, birthCountry: ?string, birthState: ?string, birthCity: ?string, deathYear: ?int32, deathMonth: ?int32, deathDay: ?int32, deathCountry: ?string, deathState: ?string, deathCity: ?string, nameFirst: ?string, nameLast: ?string, nameGiven: ?string, weight: ?int32, height: ?float64, bats: ?string, throws: ?string, debut: ?float64, finalGame: ?float64, retroID: ?string, bbrefID: ?string }, Pitching: var * { playerID: ?string, yearID: ?int32, stint: ?int32, teamID: ?string, lgID: ?string, W: ?int32, L: ?int32, G: ?int32, GS: ?int32, CG: ?int32, SHO: ?int32, SV: ?int32, IPouts: ?int32, H: ?int32, ER: ?int32, HR: ?int32, BB: ?int32, SO: ?int32, BAOpp: ?float64, ERA: ?float64, IBB: ?int32, WP: ?int32, HBP: ?int32, BK: ?int32, BFP: ?int32, GF: ?int32, R: ?int32, SH: ?int32, SF: ?int32, GIDP: ?int32 }, PitchingPost: var * { playerID: ?string, yearID: ?int32, round: ?string, teamID: ?string, lgID: ?string, W: ?int32, L: ?int32, G: ?int32, GS: ?int32, CG: ?int32, SHO: ?int32, SV: ?int32, IPouts: ?int32, H: ?int32, ER: ?int32, HR: ?int32, BB: ?int32, SO: ?int32, BAOpp: ?float64, ERA: ?float64, IBB: ?int32, WP: ?int32, HBP: ?int32, BK: ?int32, BFP: ?int32, GF: ?int32, R: ?int32, SH: ?int32, SF: ?int32, GIDP: ?int32 }, Salaries: var * { yearID: ?int32, teamID: ?string, lgID: ?string, playerID: ?string, salary: ?float64 }, Schools: var * { schoolID: ?string, schoolName: ?string, schoolCity: ?string, schoolState: ?string, schoolNick: ?string }, SchoolsPlayers: var * { playerID: ?string, schoolID: ?string, yearMin: ?int32, yearMax: ?int32 }, SeriesPost: var * { yearID: ?int32, round: ?string, teamIDwinner: ?string, lgIDwinner: ?string, teamIDloser: ?string, lgIDloser: ?string, wins: ?int32, losses: ?int32, ties: ?int32 }, Teams: var * { yearID: ?int32, lgID: ?string, teamID: ?string, franchID: ?string, divID: ?string, Rank: ?int32, G: ?int32, Ghome: ?int32, W: ?int32, L: ?int32, DivWin: ?string, WCWin: ?string, LgWin: ?string, WSWin: ?string, R: ?int32, AB: ?int32, H: ?int32, 2B: ?int32, 3B: ?int32, HR: ?int32, BB: ?int32, SO: ?int32, SB: ?int32, CS: ?int32, HBP: ?int32, SF: ?int32, RA: ?int32, ER: ?int32, ERA: ?float64, CG: ?int32, SHO: ?int32, SV: ?int32, IPouts: ?int32, HA: ?int32, HRA: ?int32, BBA: ?int32, SOA: ?int32, E: ?int32, DP: ?int32, FP: ?float64, name: ?string, park: ?string, attendance: ?int32, BPF: ?int32, PPF: ?int32, teamIDBR: ?string, teamIDlahman45: ?string, teamIDretro: ?string }, TeamsFranchises: var * { franchID: ?string, franchName: ?string, active: ?string, NAassoc: ?string }, TeamsHalf: var * { yearID: ?int32, lgID: ?string, teamID: ?string, Half: ?string, divID: ?string, DivWin: ?string, Rank: ?int32, G: ?int32, W: ?int32, L: ?int32 }, temp: var * {ID: ?int32, namefull: ?string, born: ?float64} }""")
Some storage systems don't cleanly support some datashapes. For example
Because of this we sometimes want to slightly change a datashape during migration.
A common example is the use of strings with a known maximum length, called "fixed length strings," and strings with particular character encodings, like ASCII or UTF-8.
Consider moving the following data, including strings, into a numpy array.
data = [{'name': 'Alice', 'balance': 100},
{'name': 'Bob', 'balance': 200}]
into(np.ndarray, data)
array([(100, 'Alice'), (200, 'Bob')], dtype=[('balance', '<i8'), ('name', 'O')])
Notice that the dtype for the name
field is 'O'
for Python object type. NumPy doesn't support arbitrary length strings and putting raw Python objects into a NumPy array can hurt performance. So we choose a maximum length and tell that to into with a datashape.
So how do we construct a datashape? We could write it out by hand, but that is error prone. Instead, we ask for the datashape of data
, then copy-paste, making a small adjustment
discover(data)
dshape("2 * {balance: int64, name: string}")
from into import dshape
ds = dshape("2 * {balance: int64, name: string[10]}") # max length of 10
into(np.ndarray, data, dshape=ds)
array([(100, u'Alice'), (200, u'Bob')], dtype=[('balance', '<i8'), ('name', '<U10')])
Note the new type, U10
, meaning Unicode of length ten or less. Our data happens to be ASCII, so we'll specialize our datashape even more.
ds = dshape("2 * {balance: int64, name: string[10, 'ascii']}") # max length of 10
into(np.ndarray, data, dshape=ds)
array([(100, 'Alice'), (200, 'Bob')], dtype=[('balance', '<i8'), ('name', 'S10')])
Move the data in 'data/iris.csv'
to the HDF5 dataset, 'data/myfile.hdf5::/iris'
. Annotate the datatype of the species
column to be ASCII with maximum length 30.
csv = resource('data/iris.csv')
discover(csv)
dshape("""var * { sepal_length: float64, sepal_width: float64, petal_length: float64, petal_width: float64, species: string }""")
Sometimes we store data in formats that lack metadata. Adding a datashape can help to fill in these gaps.
For example we might want to put the following latitude-longitude pairs into a DataFrame.
data = [(33.1, -89.2), (37, -141.5), (41, -120.5)]
into(pd.DataFrame, data)
0 | 1 | |
---|---|---|
0 | 33.1 | -89.2 |
1 | 37.0 | -141.5 |
2 | 41.0 | -120.5 |
Note that, because our original data was stored in a format that didn't include the column names, the output lacks them as well. We complement our data with a datashape.
ds = dshape('var * {lat: float64, long: float64}')
into(pd.DataFrame, data, dshape=ds)
lat | long | |
---|---|---|
0 | 33.1 | -89.2 |
1 | 37.0 | -141.5 |
2 | 41.0 | -120.5 |
resource
and datashape¶In the last section we used resource
to acquire existing datasets from string URIs.
We also use the resource
function to create new datasets given a URI and a datashape.
We create a new HDF5 dataset to store 100 by 100 integers.
ds = dshape('100 * 100 * int64')
dset = resource('myfile.hdf5::/x', dshape=ds)
dset
<HDF5 dataset "x": shape (100, 100), type "<i8">
Create a new SQLite table named transactions
in 'data/my.db'
with the following datashape
var * {name: string, balance: int, timestamp: datetime}
Here var
stands for "variable length" or generally "a dimension to which we don't know a fixed size."
Note that you could also have built this table using raw SQLAlchemy code. Knowing datashape lets you skip learning many libraries like SQLAlchemy and H5Py for simple tasks. into
serves as a single interface over many useful libraries.
import sqlalchemy as sa
engine = sa.create_engine('sqlite:///data/my.db')
metadata = sa.MetaData(engine)
transactions = sa.Table('transactions2', metadata,
sa.Column('name', sa.String),
sa.Column('balance', sa.Integer),
sa.Column('timestamp', sa.DateTime))
transactions.create()