Importing Data¶

In [1]:

import numpy as np
import pandas as pd
pd.options.display.max_rows = 6
pd.options.display.max_columns = 8
pd.__version__ 

Out[1]:

'0.17.0'

We often times have a variety of input data.

CSV
Excel
SQL
JSON
HDF5
pickle
msgpack
Stata
BigQuery

This is subset of the data from beeradvocate.com, via Standford. It's strangely formatted.

This dataset is no longer available!

beer/name: Sausa Weizen
beer/beerI: 47986
beer/brewerId: 10325
beer/ABV: 5.00
beer/style: Hefeweizen
review/appearance: 2.5
review/aroma: 2
review/time: 1234817823
review/profileName: stcules
review/text: A lot of foam. But a lot. In the smell some banana, and then lactic and tart. Not a good start. Quite dark orange in color, with a lively carbonation (now visible, under the foam). Again tending to lactic sourness. Same for the taste. With some yeast and banana.

beer/name: Red Moon
beer/beerId: 48213
beer/brewerId: 10325
beer/ABV: 6.20
...

CSV¶

http://pandas.pydata.org/pandas-docs/stable/io.html#csv-text-files

In [2]:

df = pd.read_csv('data/beer2.csv.gz', 
                 index_col=0,
                 parse_dates=['time'],
                 encoding='utf-8')

In [3]:

df

Out[3]:

	abv	beer_id	brewer_id	beer_name	...	profile_name	review_taste	text	time
0	7.0	2511	287	Bell's Cherry Stout	...	blaheath	4.5	Batch 8144\tPitch black in color with a 1/2 f...	2009-10-05 21:31:48
1	5.7	19736	9790	Duck-Rabbit Porter	...	GJ40	4.0	Sampled from a 12oz bottle in a standard pint...	2009-10-05 21:32:09
2	4.8	11098	3182	Fürstenberg Premium Pilsener	...	biegaman	3.5	Haystack yellow with an energetic group of bu...	2009-10-05 21:32:13
...	...	...	...	...	...	...	...	...	...
49997	8.1	21950	2372	Terrapin Coffee Oatmeal Imperial Stout	...	ugaterrapin	4.5	Poured a light sucking crude oil beckoning bl...	2009-12-25 17:23:52
49998	4.6	5453	1306	Badger Original Ale	...	MrHurmateeowish	3.5	500ml brown bottle, 4.0% ABV. Pours a crystal...	2009-12-25 17:25:06
49999	9.4	47695	14879	Barrel Aged B.O.R.I.S. Oatmeal Imperial Stout	...	strictly4DK	4.5	22 oz bottle poured into a flute glass, share...	2009-12-25 17:26:06

50000 rows × 13 columns

In [4]:

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50000 entries, 0 to 49999
Data columns (total 13 columns):
abv                  48389 non-null float64
beer_id              50000 non-null int64
brewer_id            50000 non-null int64
beer_name            50000 non-null object
beer_style           50000 non-null object
review_appearance    50000 non-null float64
review_aroma         50000 non-null float64
review_overall       50000 non-null float64
review_palate        50000 non-null float64
profile_name         50000 non-null object
review_taste         50000 non-null float64
text                 49991 non-null object
time                 50000 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(6), int64(2), object(4)
memory usage: 5.3+ MB

In [5]:

# we have some unicode
df.loc[2,'beer_name']

Out[5]:

'Fürstenberg Premium Pilsener'

In [6]:

df.to_csv('data/beer.csv', index=False, encoding='utf-8')

Excel¶

http://pandas.pydata.org/pandas-docs/stable/io.html#excel-files

In [7]:

df.to_excel('data/beer.xls', index=False, encoding='utf-8')

In [8]:

data = pd.read_excel('data/beer.xls', sheetnames=[0], encoding='utf-8')

SQL¶

http://pandas.pydata.org/pandas-docs/stable/io.html#sql-queries

In [9]:

from sqlalchemy import create_engine
!rm -f data/beer.sqlite
engine = create_engine('sqlite:///data/beer.sqlite')

In [10]:

df.to_sql('table', engine)

In [11]:

data = pd.read_sql('table', engine)

JSON¶

http://pandas.pydata.org/pandas-docs/stable/io.html#json

In [12]:

df.to_json('data/beer.json')

In [13]:

data = pd.read_json('data/beer.json')

HDF¶

http://pandas.pydata.org/pandas-docs/stable/io.html#hdf5-pytables

In [14]:

# fixed format
df.to_hdf('data/beer_mixed.hdf',
          'df',
           mode='w',
           format='fixed',
           encoding='utf-8')

/Users/jreback/miniconda/envs/python3/lib/python3.4/site-packages/pandas/core/generic.py:938: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block3_values] [items->['beer_name', 'beer_style', 'profile_name', 'text']]

  return pytables.to_hdf(path_or_buf, key, self, **kwargs)

In [15]:

df[df.text.isnull()]

Out[15]:

	abv	beer_id	brewer_id	beer_name	...	profile_name	review_taste	text	time
6663	4.5	47562	13307	It's Alright!	...	Frogzilla	1.0	NaN	2009-10-16 00:51:28
13012	8.0	2508	222	Maredsous 8 - Dubbel	...	Flightoficarus	4.0	NaN	2009-10-28 08:09:25
21034	6.5	14378	2593	Cherry Porter	...	CurtisFagan	4.5	NaN	2009-11-09 23:57:11
...	...	...	...	...	...	...	...	...	...
40357	9.5	47785	35	Samuel Adams Double Bock (Imperial Series)	...	zeapo	4.0	NaN	2009-12-10 01:42:08
40968	9.2	47360	35	Samuel Adams Imperial Stout	...	zeapo	3.0	NaN	2009-12-11 02:14:19
48669	5.0	924	142	Franziskaner Hefe-Weisse Dunkel	...	VTBobcat	5.0	NaN	2009-12-23 13:54:37

9 rows × 13 columns

In [16]:

data = pd.read_hdf('data/beer.hdf','df',encoding='utf-8')

In [17]:

# wildly varying strings
df.text.str.len().describe()

Out[17]:

count    49991.000000
mean       733.792003
std        392.219226
             ...     
50%        642.000000
75%        900.000000
max       4902.000000
Name: text, dtype: float64

Timings¶

In [18]:

%timeit pd.read_excel('data/beer.xls', sheetnames=[0])

1 loops, best of 3: 2.85 s per loop

In [19]:

%timeit pd.read_sql('table', engine)

1 loops, best of 3: 662 ms per loop

In [20]:

%timeit pd.read_json('data/beer.json')

1 loops, best of 3: 1.15 s per loop

In [21]:

%timeit pd.read_csv('data/beer.csv', parse_dates=['time'])

1 loops, best of 3: 552 ms per loop

In [22]:

%timeit pd.read_hdf('data/beer.hdf','df')

10 loops, best of 3: 123 ms per loop

In [23]:

df.to_pickle('data/beer.pkl')
df.to_msgpack('data/beer.msgpack',encoding='utf-8')

In [24]:

%timeit pd.read_pickle('data/beer.pkl')

10 loops, best of 3: 40.2 ms per loop

In [25]:

%timeit pd.read_msgpack('data/beer.msgpack', encoding='utf-8')

10 loops, best of 3: 60.7 ms per loop

Storing Text vs Data¶

http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization/

Operating on Large Data¶

In [26]:

chunks = pd.read_csv('data/beer2.csv.gz', 
                      index_col=0,
                      parse_dates=['time'],
                      chunksize=10000)
for i, chunk in enumerate(chunks):
    print("%d -> %d" % (i, len(chunk)))

Using Odo¶

http://odo.readthedocs.org/

Questions¶

which formats provide good fidelity
- hdf5, pickle, msgpack
which formats can you query
- hdf5, sql
which formats can you iterate
- csv, hdf5, sql
which formats provide better interoprability
- csv, json, excel
which formats can you transmit over the wire
- json, msgpack
which formats have better compression
- hdf5, pickle, msgpack
which formats allow multiple datasets in the same file
- hdf5, msgpack