Examining the Data¶

In [1]:

import numpy as np
import pandas as pd
pd.options.display.max_rows = 6
pd.options.display.max_columns = 8
pd.options.display.width = 100

In [2]:

df = pd.read_hdf('data/beer.hdf','df')
df

Out[2]:

	abv	beer_id	brewer_id	beer_name	...	profile_name	review_taste	text	time
0	7.0	2511	287	Bell's Cherry Stout	...	blaheath	4.5	Batch 8144\tPitch black in color with a 1/2 f...	2009-10-05 21:31:48
1	5.7	19736	9790	Duck-Rabbit Porter	...	GJ40	4.0	Sampled from a 12oz bottle in a standard pint...	2009-10-05 21:32:09
2	4.8	11098	3182	Fürstenberg Premium Pilsener	...	biegaman	3.5	Haystack yellow with an energetic group of bu...	2009-10-05 21:32:13
...	...	...	...	...	...	...	...	...	...
49997	8.1	21950	2372	Terrapin Coffee Oatmeal Imperial Stout	...	ugaterrapin	4.5	Poured a light sucking crude oil beckoning bl...	2009-12-25 17:23:52
49998	4.6	5453	1306	Badger Original Ale	...	MrHurmateeowish	3.5	500ml brown bottle, 4.0% ABV. Pours a crystal...	2009-12-25 17:25:06
49999	9.4	47695	14879	Barrel Aged B.O.R.I.S. Oatmeal Imperial Stout	...	strictly4DK	4.5	22 oz bottle poured into a flute glass, share...	2009-12-25 17:26:06

50000 rows × 13 columns

In [3]:

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50000 entries, 0 to 49999
Data columns (total 13 columns):
abv                  48389 non-null float64
beer_id              50000 non-null int64
brewer_id            50000 non-null int64
beer_name            50000 non-null object
beer_style           50000 non-null object
review_appearance    50000 non-null float64
review_aroma         50000 non-null float64
review_overall       50000 non-null float64
review_palate        50000 non-null float64
profile_name         50000 non-null object
review_taste         50000 non-null float64
text                 50000 non-null object
time                 50000 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(6), int64(2), object(4)
memory usage: 5.3+ MB

Text Data & .str accessor¶

http://pandas.pydata.org/pandas-docs/stable/text.html

In [4]:

df.beer_style

Out[4]:

0                          American Stout
1                         American Porter
2                         German Pilsener
                       ...               
49997    American Double / Imperial Stout
49998                    English Pale Ale
49999              Russian Imperial Stout
Name: beer_style, dtype: object

In [5]:

df.beer_style.str.len()

Out[5]:

0        14
1        15
2        15
         ..
49997    32
49998    16
49999    22
Name: beer_style, dtype: int64

In [6]:

df.beer_style.str.contains('[A|a]merican')

Out[6]:

0         True
1         True
2        False
         ...  
49997     True
49998    False
49999    False
Name: beer_style, dtype: bool

Datetime Data & .dt accessor¶

http://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components

In [7]:

df.time

Out[7]:

0       2009-10-05 21:31:48
1       2009-10-05 21:32:09
2       2009-10-05 21:32:13
                ...        
49997   2009-12-25 17:23:52
49998   2009-12-25 17:25:06
49999   2009-12-25 17:26:06
Name: time, dtype: datetime64[ns]

In [8]:

df.time.dt.date

Out[8]:

0        2009-10-05
1        2009-10-05
2        2009-10-05
            ...    
49997    2009-12-25
49998    2009-12-25
49999    2009-12-25
Name: time, dtype: object

In [9]:

df.time.dt.hour

Out[9]:

0        21
1        21
2        21
         ..
49997    17
49998    17
49999    17
Name: time, dtype: int64

Categoricals & .cat accessor¶

http://pandas.pydata.org/pandas-docs/stable/categorical.html

In [10]:

df.select_dtypes(include=['object']).describe()

Out[10]:

	beer_name	beer_style	profile_name	text
count	50000	50000	50000	50000
unique	8762	104	4124	49978
top	Sierra Nevada Celebration Ale	American IPA	drabmuh	nan
freq	240	4324	242	9

In [11]:

df[['beer_style']].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50000 entries, 0 to 49999
Data columns (total 1 columns):
beer_style    50000 non-null object
dtypes: object(1)
memory usage: 781.2+ KB

In [12]:

df['beer_style'] = df['beer_style'].astype('category')

In [13]:

df[['beer_style']].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50000 entries, 0 to 49999
Data columns (total 1 columns):
beer_style    50000 non-null category
dtypes: category(1)
memory usage: 440.3 KB

In [14]:

df.beer_style.cat.codes

Out[14]:

0        18
1        17
2        61
         ..
49997    11
49998    47
49999    89
dtype: int8

In [15]:

df.beer_style.cat.categories

Out[15]:

Index(['Altbier', 'American Adjunct Lager', 'American Amber / Red Ale',
       'American Amber / Red Lager', 'American Barleywine', 'American Black Ale',
       'American Blonde Ale', 'American Brown Ale', 'American Dark Wheat Ale',
       'American Double / Imperial IPA',
       ...
       'Scotch Ale / Wee Heavy', 'Scottish Ale', 'Scottish Gruit / Ancient Herbed Ale',
       'Smoked Beer', 'Tripel', 'Vienna Lager', 'Weizenbock', 'Wheatwine', 'Winter Warmer',
       'Witbier'],
      dtype='object', length=104)

In [16]:

df.beer_style.cat.ordered

Out[16]:

False

Indexing¶

http://pandas.pydata.org/pandas-docs/stable/indexing.html

Boolean indexing¶

Like a where clause in SQL. The indexer (or boolean mask) should be 1-dimensional and the same length as the thing being indexed.

In [17]:

df.abv < 5

Out[17]:

0        False
1        False
2         True
         ...  
49997    False
49998     True
49999    False
Name: abv, dtype: bool

In [18]:

df[df.abv < 5]

Out[18]:

	abv	beer_id	brewer_id	beer_name	...	profile_name	review_taste	text	time
2	4.8	11098	3182	Fürstenberg Premium Pilsener	...	biegaman	3.5	Haystack yellow with an energetic group of bu...	2009-10-05 21:32:13
7	4.8	1669	256	Great White	...	n0rc41	4.5	Ok, for starters great white I believe will b...	2009-10-05 21:34:29
21	4.6	401	118	Dark Island	...	abuliarose	4.0	Poured into a snifter, revealing black opaque...	2009-10-05 21:47:36
...	...	...	...	...	...	...	...	...	...
49976	4.7	18552	9663	Ichnusa	...	gregalobeer	2.0	When in Sardinia I drink this beer, and altho...	2009-12-25 15:43:48
49980	4.1	51464	29	Bud Light Golden Wheat	...	wolfpack87	3.5	Not quite as good as other wheat beers, but i...	2009-12-25 15:57:12
49998	4.6	5453	1306	Badger Original Ale	...	MrHurmateeowish	3.5	500ml brown bottle, 4.0% ABV. Pours a crystal...	2009-12-25 17:25:06

5399 rows × 13 columns

Notice that we just used [] there. We can pass the boolean indexer in to .loc as well.

In [19]:

df.loc[df.abv < 5, ['beer_style', 'review_overall']]

Out[19]:

	beer_style	review_overall
2	German Pilsener	3.0
7	Witbier	4.5
21	Scottish Ale	3.5
...	...	...
49976	Euro Pale Lager	3.0
49980	Herbed / Spiced Beer	4.0
49998	English Pale Ale	4.0

5399 rows × 2 columns

Again, you can get complicated

In [20]:

df[((df.abv < 5) & (df.time > pd.Timestamp('2009-06'))) | (
        df.review_overall >= 4.5)]

Out[20]:

	abv	beer_id	brewer_id	beer_name	...	profile_name	review_taste	text	time
0	7.0	2511	287	Bell's Cherry Stout	...	blaheath	4.5	Batch 8144\tPitch black in color with a 1/2 f...	2009-10-05 21:31:48
1	5.7	19736	9790	Duck-Rabbit Porter	...	GJ40	4.0	Sampled from a 12oz bottle in a standard pint...	2009-10-05 21:32:09
2	4.8	11098	3182	Fürstenberg Premium Pilsener	...	biegaman	3.5	Haystack yellow with an energetic group of bu...	2009-10-05 21:32:13
...	...	...	...	...	...	...	...	...	...
49997	8.1	21950	2372	Terrapin Coffee Oatmeal Imperial Stout	...	ugaterrapin	4.5	Poured a light sucking crude oil beckoning bl...	2009-12-25 17:23:52
49998	4.6	5453	1306	Badger Original Ale	...	MrHurmateeowish	3.5	500ml brown bottle, 4.0% ABV. Pours a crystal...	2009-12-25 17:25:06
49999	9.4	47695	14879	Barrel Aged B.O.R.I.S. Oatmeal Imperial Stout	...	strictly4DK	4.5	22 oz bottle poured into a flute glass, share...	2009-12-25 17:26:06

18007 rows × 13 columns

isin¶

Select just the rows where the beer_style contains IPA.

In [22]:

df[df.beer_style.cat.contains('IPA')]

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-22-f0a360065e3d> in <module>()
----> 1 df[df.beer_style.cat.contains('IPA')]

AttributeError: 'CategoricalAccessor' object has no attribute 'contains'

In [23]:

df[(df.beer_style.astype(object)).str.contains('IPA')]

Out[23]:

	abv	beer_id	brewer_id	beer_name	...	profile_name	review_taste	text	time
3	9.5	28577	3818	Unearthly (Imperial India Pale Ale)	...	nick76	4.0	The aroma has pine, wood, citrus, caramel, an...	2009-10-05 21:32:37
8	6.7	6549	140	Northern Hemisphere Harvest Wet Hop Ale	...	david18	4.0	I like all of Sierra Nevada's beers but felt ...	2009-10-05 21:34:31
16	8.0	36179	3818	Hoppe (Imperial Extra Pale Ale)	...	nick76	3.0	The aroma is papery with citrus, yeast, and s...	2009-10-05 21:43:23
...	...	...	...	...	...	...	...	...	...
49947	7.0	709	199	Big Eye IPA	...	billshmeinke	4.0	12oz into my pint glass.\t\tA: Golden honey a...	2009-12-25 08:48:11
49984	6.0	38388	3718	L'IPA Du Lièvre	...	hlance	4.5	I love this beer, seek it out when I am in Mo...	2009-12-25 16:25:45
49996	8.0	7971	863	Pliny The Elder	...	hrking	4.0	This is a big hoppy monster of an IPA..If you...	2009-12-25 17:23:24

7774 rows × 13 columns

Is there another / better way?¶

In [24]:

cats = (df
          .beer_style
          .cat
          .categories[df.beer_style.cat.categories.str.contains('IPA')]
        )
cats

Out[24]:

Index(['American Double / Imperial IPA', 'American IPA', 'Belgian IPA',
       'English India Pale Ale (IPA)'],
      dtype='object')

In [25]:

df.beer_style.isin(cats)

Out[25]:

0        False
1        False
2        False
         ...  
49997    False
49998    False
49999    False
Name: beer_style, dtype: bool

In [26]:

df[df.beer_style.isin(cats)]

Out[26]:

	abv	beer_id	brewer_id	beer_name	...	profile_name	review_taste	text	time
3	9.5	28577	3818	Unearthly (Imperial India Pale Ale)	...	nick76	4.0	The aroma has pine, wood, citrus, caramel, an...	2009-10-05 21:32:37
8	6.7	6549	140	Northern Hemisphere Harvest Wet Hop Ale	...	david18	4.0	I like all of Sierra Nevada's beers but felt ...	2009-10-05 21:34:31
16	8.0	36179	3818	Hoppe (Imperial Extra Pale Ale)	...	nick76	3.0	The aroma is papery with citrus, yeast, and s...	2009-10-05 21:43:23
...	...	...	...	...	...	...	...	...	...
49947	7.0	709	199	Big Eye IPA	...	billshmeinke	4.0	12oz into my pint glass.\t\tA: Golden honey a...	2009-12-25 08:48:11
49984	6.0	38388	3718	L'IPA Du Lièvre	...	hlance	4.5	I love this beer, seek it out when I am in Mo...	2009-12-25 16:25:45
49996	8.0	7971	863	Pliny The Elder	...	hrking	4.0	This is a big hoppy monster of an IPA..If you...	2009-12-25 17:23:24

7774 rows × 13 columns

This is quite powerful. Any method that returns a boolean array is potentially an indexer.

In [27]:

beer_ids = df.beer_id.value_counts()
beer_ids

Out[27]:

1904     240
53863    208
52441    158
        ... 
41285      1
47430      1
53274      1
Name: beer_id, dtype: int64

In [28]:

df[df.beer_id.isin(beer_ids[0:3].index)]

Out[28]:

	abv	beer_id	brewer_id	beer_name	...	profile_name	review_taste	text	time
142	8.6	52441	147	Stone 09.09.09 Vertical Epic Ale	...	aubuc1	4.5	Poured in to a chimay goblet.\t\tPours black ...	2009-10-06 00:10:46
446	8.6	52441	147	Stone 09.09.09 Vertical Epic Ale	...	corby112	3.5	Pours pitch black and completely opaque witho...	2009-10-06 07:59:01
714	8.6	52441	147	Stone 09.09.09 Vertical Epic Ale	...	alcstradamus	3.5	Near black pour with a mountainous 3 finger h...	2009-10-06 21:31:06
...	...	...	...	...	...	...	...	...	...
49715	6.8	1904	140	Sierra Nevada Celebration Ale	...	betterbgood	2.5	I've tried this beer on tap about a year ago,...	2009-12-25 01:13:38
49844	8.6	52441	147	Stone 09.09.09 Vertical Epic Ale	...	russpowell	4.0	thanks to colonelforbin for this one!\t\tPour...	2009-12-25 04:10:49
49863	5.5	53863	28	Our Special Ale 2009 (Anchor Christmas Ale)	...	FickleBeast	4.0	12 oz bottle poured into a pint glass \t\tPou...	2009-12-25 04:38:07

606 rows × 13 columns

Positional Indexing¶

In [29]:

df.head(5)

Out[29]:

	abv	beer_id	brewer_id	beer_name	...	profile_name	review_taste	text	time
0	7.0	2511	287	Bell's Cherry Stout	...	blaheath	4.5	Batch 8144\tPitch black in color with a 1/2 f...	2009-10-05 21:31:48
1	5.7	19736	9790	Duck-Rabbit Porter	...	GJ40	4.0	Sampled from a 12oz bottle in a standard pint...	2009-10-05 21:32:09
2	4.8	11098	3182	Fürstenberg Premium Pilsener	...	biegaman	3.5	Haystack yellow with an energetic group of bu...	2009-10-05 21:32:13
3	9.5	28577	3818	Unearthly (Imperial India Pale Ale)	...	nick76	4.0	The aroma has pine, wood, citrus, caramel, an...	2009-10-05 21:32:37
4	5.8	398	119	Wolaver's Pale Ale	...	champ103	3.0	A: Pours a slightly hazy golden/orange color....	2009-10-05 21:33:14

5 rows × 13 columns

In [30]:

df.iloc[[2,5,10]]

Out[30]:

	abv	beer_id	brewer_id	beer_name	...	profile_name	review_taste	text	time
2	4.8	11098	3182	Fürstenberg Premium Pilsener	...	biegaman	3.5	Haystack yellow with an energetic group of bu...	2009-10-05 21:32:13
5	7.0	966	365	Pike Street XXXXX Stout	...	sprucetip	4.5	From notes. Pours black, thin mocha head fade...	2009-10-05 21:33:48
10	11.8	43670	423	Bourbon Barrel Quad (BBQ)	...	blaheath	4.5	Burnt amber in color with a 1/4" head. Aroma ...	2009-10-05 21:36:03

3 rows × 13 columns

In [31]:

df.iloc[[2,5,10],0:3]

Out[31]:

	abv	beer_id	brewer_id
2	4.8	11098	3182
5	7.0	966	365
10	11.8	43670	423

Location Based Indexing¶

In [32]:

df.loc[[2,5,10],['beer_id','time']]

Out[32]:

	beer_id	time
2	11098	2009-10-05 21:32:13
5	966	2009-10-05 21:33:48
10	43670	2009-10-05 21:36:03

In [33]:

df.loc[df.beer_id.isin(beer_ids[0:3].index),['beer_id','time']]

Out[33]:

	beer_id	time
142	52441	2009-10-06 00:10:46
446	52441	2009-10-06 07:59:01
714	52441	2009-10-06 21:31:06
...	...	...
49715	1904	2009-12-25 01:13:38
49844	52441	2009-12-25 04:10:49
49863	53863	2009-12-25 04:38:07

606 rows × 2 columns

Questions¶

why we use .loc
why do we care about uniqueness
why do we use .ix

In [34]:

df.index.is_unique

Out[34]:

True

In [35]:

df.set_index('beer_id').index.is_unique

Out[35]:

False

Hierarchical Indexing¶

One of the most powerful and most complicated features of pandas. Let's you represent high-dimensional datasets in a table.

In [36]:

reviews = df.set_index(['profile_name', 'beer_id', 'time']).sort_index()
reviews.head()

Out[36]:

			abv	brewer_id	beer_name	beer_style	...	review_overall	review_palate	review_taste	text
profile_name	beer_id	time
01121987	29077	2009-11-30 03:44:42	9.0	11256	Corne De Brume	Scotch Ale / Wee Heavy	...	5	4	3.5	Poured into a belgian beer glass, not great h...
05Harley	1307	2009-10-06 00:10:06	8.5	428	Der Weisse Bock	Weizenbock	...	4	4	4.0	Can't find the date on this one.\t\tPurchased...
	2732	2009-12-12 01:21:36	8.0	287	Bell's Consecrator Doppelbock	Doppelbock	...	4	4	4.5	Bottle # 8881 (02/09)\t\tPurchased through We...
	2899	2009-10-20 22:27:01	7.1	911	Andechser Doppelbock Dunkel	Doppelbock	...	5	4	5.0	Bottle # 300310\t\tPurchased through Kracked ...
	3054	2009-11-21 02:17:41	5.0	946	Piton Lager Beer	American Adjunct Lager	...	3	2	3.0	Bottled in 2007.\t\tPurchased in St. Lucia @ ...

5 rows × 10 columns

Why is this useful¶

In [37]:

reviews.index.is_unique

Out[37]:

True

In [38]:

top_reviewers = (reviews
                        .index
                        .get_level_values('profile_name')
                        .value_counts()
                        .head(5)
                )
top_reviewers

Out[38]:

drabmuh           242
corby112          230
BeerFMAndy        202
northyorksammy    201
mrmanning         187
Name: profile_name, dtype: int64

In [39]:

pd.options.display.max_rows=4
reviews.loc[top_reviewers.index, :, :]

Out[39]:

			abv	brewer_id	beer_name	beer_style	...	review_overall	review_palate	review_taste	text
profile_name	beer_id	time
BeerFMAndy	92	2009-12-24 21:51:46	7.2	147	Arrogant Bastard Ale	American Strong Ale	...	4.5	4.0	4.0	22 oz bottle poured into a Sierra Nevada Impe...
BeerFMAndy	100	2009-10-22 03:39:21	5.7	306	Blue Moon Harvest Moon Pumpkin Ale	Pumpkin Ale	...	2.0	2.5	2.0	12 oz bottle poured into an Imperial Pint. Ma...
...	...	...	...	...	...	...	...	...	...	...	...
northyorksammy	54918	2009-12-21 14:56:29	12.0	2097	Kuhnhenn Bonicci Barley Wine	American Barleywine	...	3.0	3.5	3.5	Winner of the amateur brewing fest,Jerry Boni...
northyorksammy	54919	2009-12-21 14:59:39	7.0	2097	Kuhnhenn Foreign Export Stout	Foreign / Export Stout	...	3.5	4.0	3.5	Thick stout, a study in chocolate. Very smoot...

1062 rows × 10 columns

In [40]:

reviews.loc[[('BeerFMAndy',100,pd.Timestamp('2009-10-22 03:39:21'))]]

Out[40]:

			abv	brewer_id	beer_name	beer_style	...	review_overall	review_palate	review_taste	text
profile_name	beer_id	time
BeerFMAndy	100	2009-10-22 03:39:21	5.7	306	Blue Moon Harvest Moon Pumpkin Ale	Pumpkin Ale	...	2	2.5	2	12 oz bottle poured into an Imperial Pint. Ma...

1 rows × 10 columns

Specifying multi-axis indexers¶

In [41]:

idx = pd.IndexSlice

In [42]:

reviews.loc[idx[top_reviewers.index, [92,54919], :], 
               ['beer_name','beer_style']]

Out[42]:

			beer_name	beer_style
profile_name	beer_id	time
BeerFMAndy	92	2009-12-24 21:51:46	Arrogant Bastard Ale	American Strong Ale
northyorksammy	54919	2009-12-21 14:59:39	Kuhnhenn Foreign Export Stout	Foreign / Export Stout

In [43]:

(reviews
        .query('profile_name in @top_reviewers.index and \
                beer_id in [92,54919]')
        [['beer_name','beer_style']]
 )

Out[43]:

			beer_name	beer_style
profile_name	beer_id	time
BeerFMAndy	92	2009-12-24 21:51:46	Arrogant Bastard Ale	American Strong Ale
northyorksammy	54919	2009-12-21 14:59:39	Kuhnhenn Foreign Export Stout	Foreign / Export Stout

In [44]:

reviews.loc[idx[:, [92, 54919], :], ['beer_name' ,'beer_style']]

Out[44]:

			beer_name	beer_style
profile_name	beer_id	time
Arnie2709	92	2009-11-23 17:16:39	Arrogant Bastard Ale	American Strong Ale
BatsforBeer	92	2009-11-06 04:10:13	Arrogant Bastard Ale	American Strong Ale
...	...	...	...	...
wahhmaster	92	2009-10-07 18:02:47	Arrogant Bastard Ale	American Strong Ale
woosterbill	92	2009-11-12 03:00:54	Arrogant Bastard Ale	American Strong Ale

72 rows × 2 columns