import numpy as np
import pandas as pd
pd.options.display.max_rows = 6
pd.options.display.max_columns = 8
pd.__version__
'0.17.0'
We often times have a variety of input data.
This is subset of the data from beeradvocate.com, via Standford. It's strangely formatted.
This dataset is no longer available!
beer/name: Sausa Weizen
beer/beerI: 47986
beer/brewerId: 10325
beer/ABV: 5.00
beer/style: Hefeweizen
review/appearance: 2.5
review/aroma: 2
review/time: 1234817823
review/profileName: stcules
review/text: A lot of foam. But a lot. In the smell some banana, and then lactic and tart. Not a good start. Quite dark orange in color, with a lively carbonation (now visible, under the foam). Again tending to lactic sourness. Same for the taste. With some yeast and banana.
beer/name: Red Moon
beer/beerId: 48213
beer/brewerId: 10325
beer/ABV: 6.20
...
df = pd.read_csv('data/beer2.csv.gz',
index_col=0,
parse_dates=['time'],
encoding='utf-8')
df
abv | beer_id | brewer_id | beer_name | ... | profile_name | review_taste | text | time | |
---|---|---|---|---|---|---|---|---|---|
0 | 7.0 | 2511 | 287 | Bell's Cherry Stout | ... | blaheath | 4.5 | Batch 8144\tPitch black in color with a 1/2 f... | 2009-10-05 21:31:48 |
1 | 5.7 | 19736 | 9790 | Duck-Rabbit Porter | ... | GJ40 | 4.0 | Sampled from a 12oz bottle in a standard pint... | 2009-10-05 21:32:09 |
2 | 4.8 | 11098 | 3182 | Fürstenberg Premium Pilsener | ... | biegaman | 3.5 | Haystack yellow with an energetic group of bu... | 2009-10-05 21:32:13 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49997 | 8.1 | 21950 | 2372 | Terrapin Coffee Oatmeal Imperial Stout | ... | ugaterrapin | 4.5 | Poured a light sucking crude oil beckoning bl... | 2009-12-25 17:23:52 |
49998 | 4.6 | 5453 | 1306 | Badger Original Ale | ... | MrHurmateeowish | 3.5 | 500ml brown bottle, 4.0% ABV. Pours a crystal... | 2009-12-25 17:25:06 |
49999 | 9.4 | 47695 | 14879 | Barrel Aged B.O.R.I.S. Oatmeal Imperial Stout | ... | strictly4DK | 4.5 | 22 oz bottle poured into a flute glass, share... | 2009-12-25 17:26:06 |
50000 rows × 13 columns
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 50000 entries, 0 to 49999 Data columns (total 13 columns): abv 48389 non-null float64 beer_id 50000 non-null int64 brewer_id 50000 non-null int64 beer_name 50000 non-null object beer_style 50000 non-null object review_appearance 50000 non-null float64 review_aroma 50000 non-null float64 review_overall 50000 non-null float64 review_palate 50000 non-null float64 profile_name 50000 non-null object review_taste 50000 non-null float64 text 49991 non-null object time 50000 non-null datetime64[ns] dtypes: datetime64[ns](1), float64(6), int64(2), object(4) memory usage: 5.3+ MB
# we have some unicode
df.loc[2,'beer_name']
'Fürstenberg Premium Pilsener'
df.to_csv('data/beer.csv', index=False, encoding='utf-8')
df.to_excel('data/beer.xls', index=False, encoding='utf-8')
data = pd.read_excel('data/beer.xls', sheetnames=[0], encoding='utf-8')
from sqlalchemy import create_engine
!rm -f data/beer.sqlite
engine = create_engine('sqlite:///data/beer.sqlite')
df.to_sql('table', engine)
data = pd.read_sql('table', engine)
df.to_json('data/beer.json')
data = pd.read_json('data/beer.json')
# fixed format
df.to_hdf('data/beer_mixed.hdf',
'df',
mode='w',
format='fixed',
encoding='utf-8')
/Users/jreback/miniconda/envs/python3/lib/python3.4/site-packages/pandas/core/generic.py:938: PerformanceWarning: your performance may suffer as PyTables will pickle object types that it cannot map directly to c-types [inferred_type->mixed,key->block3_values] [items->['beer_name', 'beer_style', 'profile_name', 'text']] return pytables.to_hdf(path_or_buf, key, self, **kwargs)
df[df.text.isnull()]
abv | beer_id | brewer_id | beer_name | ... | profile_name | review_taste | text | time | |
---|---|---|---|---|---|---|---|---|---|
6663 | 4.5 | 47562 | 13307 | It's Alright! | ... | Frogzilla | 1.0 | NaN | 2009-10-16 00:51:28 |
13012 | 8.0 | 2508 | 222 | Maredsous 8 - Dubbel | ... | Flightoficarus | 4.0 | NaN | 2009-10-28 08:09:25 |
21034 | 6.5 | 14378 | 2593 | Cherry Porter | ... | CurtisFagan | 4.5 | NaN | 2009-11-09 23:57:11 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
40357 | 9.5 | 47785 | 35 | Samuel Adams Double Bock (Imperial Series) | ... | zeapo | 4.0 | NaN | 2009-12-10 01:42:08 |
40968 | 9.2 | 47360 | 35 | Samuel Adams Imperial Stout | ... | zeapo | 3.0 | NaN | 2009-12-11 02:14:19 |
48669 | 5.0 | 924 | 142 | Franziskaner Hefe-Weisse Dunkel | ... | VTBobcat | 5.0 | NaN | 2009-12-23 13:54:37 |
9 rows × 13 columns
data = pd.read_hdf('data/beer.hdf','df',encoding='utf-8')
# wildly varying strings
df.text.str.len().describe()
count 49991.000000 mean 733.792003 std 392.219226 ... 50% 642.000000 75% 900.000000 max 4902.000000 Name: text, dtype: float64
%timeit pd.read_excel('data/beer.xls', sheetnames=[0])
1 loops, best of 3: 2.85 s per loop
%timeit pd.read_sql('table', engine)
1 loops, best of 3: 662 ms per loop
%timeit pd.read_json('data/beer.json')
1 loops, best of 3: 1.15 s per loop
%timeit pd.read_csv('data/beer.csv', parse_dates=['time'])
1 loops, best of 3: 552 ms per loop
%timeit pd.read_hdf('data/beer.hdf','df')
10 loops, best of 3: 123 ms per loop
df.to_pickle('data/beer.pkl')
df.to_msgpack('data/beer.msgpack',encoding='utf-8')
%timeit pd.read_pickle('data/beer.pkl')
10 loops, best of 3: 40.2 ms per loop
%timeit pd.read_msgpack('data/beer.msgpack', encoding='utf-8')
10 loops, best of 3: 60.7 ms per loop
http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization/
chunks = pd.read_csv('data/beer2.csv.gz',
index_col=0,
parse_dates=['time'],
chunksize=10000)
for i, chunk in enumerate(chunks):
print("%d -> %d" % (i, len(chunk)))
0 -> 10000 1 -> 10000 2 -> 10000 3 -> 10000 4 -> 10000
which formats provide good fidelity
which formats can you query
which formats can you iterate
which formats provide better interoprability
which formats can you transmit over the wire
which formats have better compression
which formats allow multiple datasets in the same file