Blaze Logo

Getting started with `into`¶

Full tutorial available at http://github.com/ContinuumIO/blaze-tutorial
Install software with conda install -c blaze blaze

2. Datatypes for Performance tuning¶

Do we store these as integers or floats? int32 or int64? int8?
Do we store times as datetimes or as strings?
Do we store these strings as variable length or fixed length?
Do we know how large this array will be?

As we encode values as bits we make choices; those choices can affect performance. We encode how to convert values to bits and back as a datatype. You've seen data types before in many forms including C types like long, double and double[100], numpy dtypes like i4 and f8 or Python types like int, and float. Other systems like SQL, HDF5, etc. have similar datatype systems with different names.

To manage datatypes across different systems we use datashape a datatype system that maps cleanly on to all systems with which into interacts. This one system can translate into any of the others.

In this section we'll talk about the following

How to discover the datatype of your data, no matter how it is stored
Minor tweaks you can do to that datatype to improve performance in certain storage systems
How to create new datasets easily

2.1 DataShape and `discover`¶

We introduce datashape, an all-encompassing datatype language, and discover, a function that does all of the work for you.

The discover function returns the datashape of an object. Lets look at a few examples.

In [1]:

from into import discover, into, resource
discover(1)

Out[1]:

ctype("int64")

In [2]:

discover([1, 2, 3])

Out[2]:

dshape("3 * int64")

In [3]:

discover([[1, 2, 3],
          [4, 5, 6]])

Out[3]:

dshape("2 * 3 * int64")

In [4]:

discover([{'x': 1, 'y': 1.0},
          {'x': 2, 'y': 2.0},
          {'x': 3, 'y': 3.0}])

Out[4]:

dshape("3 * {x: int64, y: float64}")

In [5]:

import pandas as pd
df = pd.DataFrame([['Alice',   100],
                   ['Bob',     200],
                   ['Charlie', 300]], columns=['name', 'balance'])
discover(df)

Out[5]:

dshape("3 * {name: string, balance: int64}")

By looking closely at these examples we see the structure of datashape. Elements have types like int64 or string. Records/structs/groups of elements have record dtypes like {x: int64, y: float64}. Lengths of collections are encoded by numbers like 3 * for "three of" or 2 * 3 * for "a two-by-three grid of".

Exercise¶

Construct data with the following datashapes. Use any container type (e.g. list, pd.DataFrame, np.ndarray).

2 * int64
2 * string
{name: string, id: int}
datetime
{name: string, id: int, payments: 2 * datetime}
2 * {name: string, id: int, payments: 2 * datetime}
5 * 5 * 5 * float32

Use discover to verify your answers.

In [6]:

# Should be 2 * int64
discover([1, 2])  

Out[6]:

dshape("2 * int64")

In [6]:

Discover doesn't care about storage format¶

The discover function doesn't care if your data lives in a Python list, Pandas DataFrame, NumPy Array, CSV file, PySpark RDD, or SQL database.

In other words, using into preserves datashape.

In [7]:

discover(df)

Out[7]:

dshape("3 * {name: string, balance: int64}")

In [8]:

import numpy as np
x = into(np.ndarray, df)
discover(x)  # different container, same datashape

Out[8]:

dshape("3 * {name: string, balance: int64}")

In [9]:

t = into('sqlite:///:memory:::mydf', df)
discover(t)  # different container, mostly the same datashape

Out[9]:

dshape("var * {name: string, balance: int64}")

Discover works on nested structures¶

Call discover on a single table of our baseball statistics database.

In [10]:

salaries = resource('sqlite:///data/lahman2013.sqlite::Salaries')
discover(salaries)

Out[10]:

dshape("""var * {
  yearID: ?int32,
  teamID: ?string,
  lgID: ?string,
  playerID: ?string,
  salary: ?float64
  }""")

And then call it on the entire database

In [11]:

db = resource('sqlite:///data/lahman2013.sqlite')
discover(db)

Out[11]:

dshape("""{
  AllstarFull: var * {
    playerID: ?string,
    yearID: ?int32,
    gameNum: ?int32,
    gameID: ?string,
    teamID: ?string,
    lgID: ?string,
    GP: ?int32,
    startingPos: ?int32
    },
  Appearances: var * {
    yearID: ?int32,
    teamID: ?string,
    lgID: ?string,
    playerID: ?string,
    G_all: ?int32,
    GS: ?int32,
    G_batting: ?int32,
    G_defense: ?int32,
    G_p: ?int32,
    G_c: ?int32,
    G_1b: ?int32,
    G_2b: ?int32,
    G_3b: ?int32,
    G_ss: ?int32,
    G_lf: ?int32,
    G_cf: ?int32,
    G_rf: ?int32,
    G_of: ?int32,
    G_dh: ?int32,
    G_ph: ?int32,
    G_pr: ?int32
    },
  AwardsManagers: var * {
    playerID: ?string,
    awardID: ?string,
    yearID: ?int32,
    lgID: ?string,
    tie: ?string,
    notes: ?string
    },
  AwardsPlayers: var * {
    playerID: ?string,
    awardID: ?string,
    yearID: ?int32,
    lgID: ?string,
    tie: ?string,
    notes: ?string
    },
  AwardsShareManagers: var * {
    awardID: ?string,
    yearID: ?int32,
    lgID: ?string,
    playerID: ?string,
    pointsWon: ?int32,
    pointsMax: ?int32,
    votesFirst: ?int32
    },
  AwardsSharePlayers: var * {
    awardID: ?string,
    yearID: ?int32,
    lgID: ?string,
    playerID: ?string,
    pointsWon: ?float64,
    pointsMax: ?int32,
    votesFirst: ?float64
    },
  Batting: var * {
    playerID: ?string,
    yearID: ?int32,
    stint: ?int32,
    teamID: ?string,
    lgID: ?string,
    G: ?int32,
    G_batting: ?int32,
    AB: ?int32,
    R: ?int32,
    H: ?int32,
    2B: ?int32,
    3B: ?int32,
    HR: ?int32,
    RBI: ?int32,
    SB: ?int32,
    CS: ?int32,
    BB: ?int32,
    SO: ?int32,
    IBB: ?int32,
    HBP: ?int32,
    SH: ?int32,
    SF: ?int32,
    GIDP: ?int32,
    G_old: ?int32
    },
  BattingPost: var * {
    yearID: ?int32,
    round: ?string,
    playerID: ?string,
    teamID: ?string,
    lgID: ?string,
    G: ?int32,
    AB: ?int32,
    R: ?int32,
    H: ?int32,
    2B: ?int32,
    3B: ?int32,
    HR: ?int32,
    RBI: ?int32,
    SB: ?int32,
    CS: ?int32,
    BB: ?int32,
    SO: ?int32,
    IBB: ?int32,
    HBP: ?int32,
    SH: ?int32,
    SF: ?int32,
    GIDP: ?int32
    },
  Fielding: var * {
    playerID: ?string,
    yearID: ?int32,
    stint: ?int32,
    teamID: ?string,
    lgID: ?string,
    POS: ?string,
    G: ?int32,
    GS: ?int32,
    InnOuts: ?int32,
    PO: ?int32,
    A: ?int32,
    E: ?int32,
    DP: ?int32,
    PB: ?int32,
    WP: ?int32,
    SB: ?int32,
    CS: ?int32,
    ZR: ?float64
    },
  FieldingOF: var * {
    playerID: ?string,
    yearID: ?int32,
    stint: ?int32,
    Glf: ?int32,
    Gcf: ?int32,
    Grf: ?int32
    },
  FieldingPost: var * {
    playerID: ?string,
    yearID: ?int32,
    teamID: ?string,
    lgID: ?string,
    round: ?string,
    POS: ?string,
    G: ?int32,
    GS: ?int32,
    InnOuts: ?int32,
    PO: ?int32,
    A: ?int32,
    E: ?int32,
    DP: ?int32,
    TP: ?int32,
    PB: ?int32,
    SB: ?int32,
    CS: ?int32
    },
  HallOfFame: var * {
    playerID: ?string,
    yearid: ?int32,
    votedBy: ?string,
    ballots: ?int32,
    needed: ?int32,
    votes: ?int32,
    inducted: ?string,
    category: ?string,
    needed_note: ?string
    },
  Managers: var * {
    playerID: ?string,
    yearID: ?int32,
    teamID: ?string,
    lgID: ?string,
    inseason: ?int32,
    G: ?int32,
    W: ?int32,
    L: ?int32,
    rank: ?int32,
    plyrMgr: ?string
    },
  ManagersHalf: var * {
    playerID: ?string,
    yearID: ?int32,
    teamID: ?string,
    lgID: ?string,
    inseason: ?int32,
    half: ?int32,
    G: ?int32,
    W: ?int32,
    L: ?int32,
    rank: ?int32
    },
  Master: var * {
    playerID: ?string,
    birthYear: ?int32,
    birthMonth: ?int32,
    birthDay: ?int32,
    birthCountry: ?string,
    birthState: ?string,
    birthCity: ?string,
    deathYear: ?int32,
    deathMonth: ?int32,
    deathDay: ?int32,
    deathCountry: ?string,
    deathState: ?string,
    deathCity: ?string,
    nameFirst: ?string,
    nameLast: ?string,
    nameGiven: ?string,
    weight: ?int32,
    height: ?float64,
    bats: ?string,
    throws: ?string,
    debut: ?float64,
    finalGame: ?float64,
    retroID: ?string,
    bbrefID: ?string
    },
  Pitching: var * {
    playerID: ?string,
    yearID: ?int32,
    stint: ?int32,
    teamID: ?string,
    lgID: ?string,
    W: ?int32,
    L: ?int32,
    G: ?int32,
    GS: ?int32,
    CG: ?int32,
    SHO: ?int32,
    SV: ?int32,
    IPouts: ?int32,
    H: ?int32,
    ER: ?int32,
    HR: ?int32,
    BB: ?int32,
    SO: ?int32,
    BAOpp: ?float64,
    ERA: ?float64,
    IBB: ?int32,
    WP: ?int32,
    HBP: ?int32,
    BK: ?int32,
    BFP: ?int32,
    GF: ?int32,
    R: ?int32,
    SH: ?int32,
    SF: ?int32,
    GIDP: ?int32
    },
  PitchingPost: var * {
    playerID: ?string,
    yearID: ?int32,
    round: ?string,
    teamID: ?string,
    lgID: ?string,
    W: ?int32,
    L: ?int32,
    G: ?int32,
    GS: ?int32,
    CG: ?int32,
    SHO: ?int32,
    SV: ?int32,
    IPouts: ?int32,
    H: ?int32,
    ER: ?int32,
    HR: ?int32,
    BB: ?int32,
    SO: ?int32,
    BAOpp: ?float64,
    ERA: ?float64,
    IBB: ?int32,
    WP: ?int32,
    HBP: ?int32,
    BK: ?int32,
    BFP: ?int32,
    GF: ?int32,
    R: ?int32,
    SH: ?int32,
    SF: ?int32,
    GIDP: ?int32
    },
  Salaries: var * {
    yearID: ?int32,
    teamID: ?string,
    lgID: ?string,
    playerID: ?string,
    salary: ?float64
    },
  Schools: var * {
    schoolID: ?string,
    schoolName: ?string,
    schoolCity: ?string,
    schoolState: ?string,
    schoolNick: ?string
    },
  SchoolsPlayers: var * {
    playerID: ?string,
    schoolID: ?string,
    yearMin: ?int32,
    yearMax: ?int32
    },
  SeriesPost: var * {
    yearID: ?int32,
    round: ?string,
    teamIDwinner: ?string,
    lgIDwinner: ?string,
    teamIDloser: ?string,
    lgIDloser: ?string,
    wins: ?int32,
    losses: ?int32,
    ties: ?int32
    },
  Teams: var * {
    yearID: ?int32,
    lgID: ?string,
    teamID: ?string,
    franchID: ?string,
    divID: ?string,
    Rank: ?int32,
    G: ?int32,
    Ghome: ?int32,
    W: ?int32,
    L: ?int32,
    DivWin: ?string,
    WCWin: ?string,
    LgWin: ?string,
    WSWin: ?string,
    R: ?int32,
    AB: ?int32,
    H: ?int32,
    2B: ?int32,
    3B: ?int32,
    HR: ?int32,
    BB: ?int32,
    SO: ?int32,
    SB: ?int32,
    CS: ?int32,
    HBP: ?int32,
    SF: ?int32,
    RA: ?int32,
    ER: ?int32,
    ERA: ?float64,
    CG: ?int32,
    SHO: ?int32,
    SV: ?int32,
    IPouts: ?int32,
    HA: ?int32,
    HRA: ?int32,
    BBA: ?int32,
    SOA: ?int32,
    E: ?int32,
    DP: ?int32,
    FP: ?float64,
    name: ?string,
    park: ?string,
    attendance: ?int32,
    BPF: ?int32,
    PPF: ?int32,
    teamIDBR: ?string,
    teamIDlahman45: ?string,
    teamIDretro: ?string
    },
  TeamsFranchises: var * {
    franchID: ?string,
    franchName: ?string,
    active: ?string,
    NAassoc: ?string
    },
  TeamsHalf: var * {
    yearID: ?int32,
    lgID: ?string,
    teamID: ?string,
    Half: ?string,
    divID: ?string,
    DivWin: ?string,
    Rank: ?int32,
    G: ?int32,
    W: ?int32,
    L: ?int32
    },
  temp: var * {ID: ?int32, namefull: ?string, born: ?float64}
  }""")

2.2 Tweaking Datashapes for Performance¶

Some storage systems don't cleanly support some datashapes. For example

SQL doesn't support nested data
HDF5 doesn't cleanly support datetimes
Variable length strings are often costly in binary stores
NumPy has poor missing value support

Because of this we sometimes want to slightly change a datashape during migration.

Fixed Length Strings¶

A common example is the use of strings with a known maximum length, called "fixed length strings," and strings with particular character encodings, like ASCII or UTF-8.

Consider moving the following data, including strings, into a numpy array.

In [12]:

data = [{'name': 'Alice', 'balance': 100},
        {'name': 'Bob',   'balance': 200}]
into(np.ndarray, data)

Out[12]:

array([(100, 'Alice'), (200, 'Bob')], 
      dtype=[('balance', '<i8'), ('name', 'O')])

Notice that the dtype for the name field is 'O' for Python object type. NumPy doesn't support arbitrary length strings and putting raw Python objects into a NumPy array can hurt performance. So we choose a maximum length and tell that to into with a datashape.

So how do we construct a datashape? We could write it out by hand, but that is error prone. Instead, we ask for the datashape of data, then copy-paste, making a small adjustment

In [13]:

discover(data)

Out[13]:

dshape("2 * {balance: int64, name: string}")

In [14]:

from into import dshape
ds = dshape("2 * {balance: int64, name: string[10]}") # max length of 10

into(np.ndarray, data, dshape=ds)

Out[14]:

array([(100, u'Alice'), (200, u'Bob')], 
      dtype=[('balance', '<i8'), ('name', '<U10')])

Note the new type, U10, meaning Unicode of length ten or less. Our data happens to be ASCII, so we'll specialize our datashape even more.

In [15]:

ds = dshape("2 * {balance: int64, name: string[10, 'ascii']}") # max length of 10
into(np.ndarray, data, dshape=ds)

Out[15]:

array([(100, 'Alice'), (200, 'Bob')], 
      dtype=[('balance', '<i8'), ('name', 'S10')])

Exercise¶

Move the data in 'data/iris.csv' to the HDF5 dataset, 'data/myfile.hdf5::/iris'. Annotate the datatype of the species column to be ASCII with maximum length 30.

In [16]:

csv = resource('data/iris.csv')
discover(csv)

Out[16]:

dshape("""var * {
  sepal_length: float64,
  sepal_width: float64,
  petal_length: float64,
  petal_width: float64,
  species: string
  }""")

In [16]:

Adding more metadata with DataShape¶

Sometimes we store data in formats that lack metadata. Adding a datashape can help to fill in these gaps.

For example we might want to put the following latitude-longitude pairs into a DataFrame.

In [17]:

data = [(33.1, -89.2), (37, -141.5), (41, -120.5)]
into(pd.DataFrame, data)

Out[17]:

	0	1
0	33.1	-89.2
1	37.0	-141.5
2	41.0	-120.5

Note that, because our original data was stored in a format that didn't include the column names, the output lacks them as well. We complement our data with a datashape.

In [18]:

ds = dshape('var * {lat: float64, long: float64}')
into(pd.DataFrame, data, dshape=ds)

Out[18]:

	lat	long
0	33.1	-89.2
1	37.0	-141.5
2	41.0	-120.5

Create new datasets with `resource` and datashape¶

In the last section we used resource to acquire existing datasets from string URIs.

We also use the resource function to create new datasets given a URI and a datashape.

Example¶

We create a new HDF5 dataset to store 100 by 100 integers.

In [19]:

ds = dshape('100 * 100 * int64')
dset = resource('myfile.hdf5::/x', dshape=ds)
dset

Out[19]:

<HDF5 dataset "x": shape (100, 100), type "<i8">

Exercise¶

Create a new SQLite table named transactions in 'data/my.db' with the following datashape

var * {name: string, balance: int, timestamp: datetime}

Here var stands for "variable length" or generally "a dimension to which we don't know a fixed size."

In [19]:

Note that you could also have built this table using raw SQLAlchemy code. Knowing datashape lets you skip learning many libraries like SQLAlchemy and H5Py for simple tasks. into serves as a single interface over many useful libraries.

In [20]:

import sqlalchemy as sa

engine = sa.create_engine('sqlite:///data/my.db')
metadata = sa.MetaData(engine)
transactions = sa.Table('transactions2', metadata,
                        sa.Column('name', sa.String),
                        sa.Column('balance', sa.Integer),
                        sa.Column('timestamp', sa.DateTime))
transactions.create()

In [20]:

Getting started with into¶

2. Datatypes for Performance tuning¶

2.1 DataShape and discover¶

Exercise¶

Discover doesn't care about storage format¶

Discover works on nested structures¶

2.2 Tweaking Datashapes for Performance¶

Fixed Length Strings¶

Exercise¶

Adding more metadata with DataShape¶

Create new datasets with resource and datashape¶

Example¶

Exercise¶

Getting started with `into`¶

2.1 DataShape and `discover`¶

Create new datasets with `resource` and datashape¶