import numpy as np
import pandas as pd
pd.options.display.max_rows = 6
pd.options.display.max_columns = 8
pd.options.display.width = 100
df = pd.read_hdf('data/beer.hdf','df')
df
abv | beer_id | brewer_id | beer_name | ... | profile_name | review_taste | text | time | |
---|---|---|---|---|---|---|---|---|---|
0 | 7.0 | 2511 | 287 | Bell's Cherry Stout | ... | blaheath | 4.5 | Batch 8144\tPitch black in color with a 1/2 f... | 2009-10-05 21:31:48 |
1 | 5.7 | 19736 | 9790 | Duck-Rabbit Porter | ... | GJ40 | 4.0 | Sampled from a 12oz bottle in a standard pint... | 2009-10-05 21:32:09 |
2 | 4.8 | 11098 | 3182 | Fürstenberg Premium Pilsener | ... | biegaman | 3.5 | Haystack yellow with an energetic group of bu... | 2009-10-05 21:32:13 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49997 | 8.1 | 21950 | 2372 | Terrapin Coffee Oatmeal Imperial Stout | ... | ugaterrapin | 4.5 | Poured a light sucking crude oil beckoning bl... | 2009-12-25 17:23:52 |
49998 | 4.6 | 5453 | 1306 | Badger Original Ale | ... | MrHurmateeowish | 3.5 | 500ml brown bottle, 4.0% ABV. Pours a crystal... | 2009-12-25 17:25:06 |
49999 | 9.4 | 47695 | 14879 | Barrel Aged B.O.R.I.S. Oatmeal Imperial Stout | ... | strictly4DK | 4.5 | 22 oz bottle poured into a flute glass, share... | 2009-12-25 17:26:06 |
50000 rows × 13 columns
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 50000 entries, 0 to 49999 Data columns (total 13 columns): abv 48389 non-null float64 beer_id 50000 non-null int64 brewer_id 50000 non-null int64 beer_name 50000 non-null object beer_style 50000 non-null object review_appearance 50000 non-null float64 review_aroma 50000 non-null float64 review_overall 50000 non-null float64 review_palate 50000 non-null float64 profile_name 50000 non-null object review_taste 50000 non-null float64 text 50000 non-null object time 50000 non-null datetime64[ns] dtypes: datetime64[ns](1), float64(6), int64(2), object(4) memory usage: 5.3+ MB
df.beer_style
0 American Stout 1 American Porter 2 German Pilsener ... 49997 American Double / Imperial Stout 49998 English Pale Ale 49999 Russian Imperial Stout Name: beer_style, dtype: object
df.beer_style.str.len()
0 14 1 15 2 15 .. 49997 32 49998 16 49999 22 Name: beer_style, dtype: int64
df.beer_style.str.contains('[A|a]merican')
0 True 1 True 2 False ... 49997 True 49998 False 49999 False Name: beer_style, dtype: bool
df.time
0 2009-10-05 21:31:48 1 2009-10-05 21:32:09 2 2009-10-05 21:32:13 ... 49997 2009-12-25 17:23:52 49998 2009-12-25 17:25:06 49999 2009-12-25 17:26:06 Name: time, dtype: datetime64[ns]
df.time.dt.date
0 2009-10-05 1 2009-10-05 2 2009-10-05 ... 49997 2009-12-25 49998 2009-12-25 49999 2009-12-25 Name: time, dtype: object
df.time.dt.hour
0 21 1 21 2 21 .. 49997 17 49998 17 49999 17 Name: time, dtype: int64
df.select_dtypes(include=['object']).describe()
beer_name | beer_style | profile_name | text | |
---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 |
unique | 8762 | 104 | 4124 | 49978 |
top | Sierra Nevada Celebration Ale | American IPA | drabmuh | nan |
freq | 240 | 4324 | 242 | 9 |
df[['beer_style']].info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 50000 entries, 0 to 49999 Data columns (total 1 columns): beer_style 50000 non-null object dtypes: object(1) memory usage: 781.2+ KB
df['beer_style'] = df['beer_style'].astype('category')
df[['beer_style']].info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 50000 entries, 0 to 49999 Data columns (total 1 columns): beer_style 50000 non-null category dtypes: category(1) memory usage: 440.3 KB
df.beer_style.cat.codes
0 18 1 17 2 61 .. 49997 11 49998 47 49999 89 dtype: int8
df.beer_style.cat.categories
Index(['Altbier', 'American Adjunct Lager', 'American Amber / Red Ale', 'American Amber / Red Lager', 'American Barleywine', 'American Black Ale', 'American Blonde Ale', 'American Brown Ale', 'American Dark Wheat Ale', 'American Double / Imperial IPA', ... 'Scotch Ale / Wee Heavy', 'Scottish Ale', 'Scottish Gruit / Ancient Herbed Ale', 'Smoked Beer', 'Tripel', 'Vienna Lager', 'Weizenbock', 'Wheatwine', 'Winter Warmer', 'Witbier'], dtype='object', length=104)
df.beer_style.cat.ordered
False
Like a where clause in SQL. The indexer (or boolean mask) should be 1-dimensional and the same length as the thing being indexed.
df.abv < 5
0 False 1 False 2 True ... 49997 False 49998 True 49999 False Name: abv, dtype: bool
df[df.abv < 5]
abv | beer_id | brewer_id | beer_name | ... | profile_name | review_taste | text | time | |
---|---|---|---|---|---|---|---|---|---|
2 | 4.8 | 11098 | 3182 | Fürstenberg Premium Pilsener | ... | biegaman | 3.5 | Haystack yellow with an energetic group of bu... | 2009-10-05 21:32:13 |
7 | 4.8 | 1669 | 256 | Great White | ... | n0rc41 | 4.5 | Ok, for starters great white I believe will b... | 2009-10-05 21:34:29 |
21 | 4.6 | 401 | 118 | Dark Island | ... | abuliarose | 4.0 | Poured into a snifter, revealing black opaque... | 2009-10-05 21:47:36 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49976 | 4.7 | 18552 | 9663 | Ichnusa | ... | gregalobeer | 2.0 | When in Sardinia I drink this beer, and altho... | 2009-12-25 15:43:48 |
49980 | 4.1 | 51464 | 29 | Bud Light Golden Wheat | ... | wolfpack87 | 3.5 | Not quite as good as other wheat beers, but i... | 2009-12-25 15:57:12 |
49998 | 4.6 | 5453 | 1306 | Badger Original Ale | ... | MrHurmateeowish | 3.5 | 500ml brown bottle, 4.0% ABV. Pours a crystal... | 2009-12-25 17:25:06 |
5399 rows × 13 columns
Notice that we just used []
there. We can pass the boolean indexer in to .loc
as well.
df.loc[df.abv < 5, ['beer_style', 'review_overall']]
beer_style | review_overall | |
---|---|---|
2 | German Pilsener | 3.0 |
7 | Witbier | 4.5 |
21 | Scottish Ale | 3.5 |
... | ... | ... |
49976 | Euro Pale Lager | 3.0 |
49980 | Herbed / Spiced Beer | 4.0 |
49998 | English Pale Ale | 4.0 |
5399 rows × 2 columns
Again, you can get complicated
df[((df.abv < 5) & (df.time > pd.Timestamp('2009-06'))) | (
df.review_overall >= 4.5)]
abv | beer_id | brewer_id | beer_name | ... | profile_name | review_taste | text | time | |
---|---|---|---|---|---|---|---|---|---|
0 | 7.0 | 2511 | 287 | Bell's Cherry Stout | ... | blaheath | 4.5 | Batch 8144\tPitch black in color with a 1/2 f... | 2009-10-05 21:31:48 |
1 | 5.7 | 19736 | 9790 | Duck-Rabbit Porter | ... | GJ40 | 4.0 | Sampled from a 12oz bottle in a standard pint... | 2009-10-05 21:32:09 |
2 | 4.8 | 11098 | 3182 | Fürstenberg Premium Pilsener | ... | biegaman | 3.5 | Haystack yellow with an energetic group of bu... | 2009-10-05 21:32:13 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49997 | 8.1 | 21950 | 2372 | Terrapin Coffee Oatmeal Imperial Stout | ... | ugaterrapin | 4.5 | Poured a light sucking crude oil beckoning bl... | 2009-12-25 17:23:52 |
49998 | 4.6 | 5453 | 1306 | Badger Original Ale | ... | MrHurmateeowish | 3.5 | 500ml brown bottle, 4.0% ABV. Pours a crystal... | 2009-12-25 17:25:06 |
49999 | 9.4 | 47695 | 14879 | Barrel Aged B.O.R.I.S. Oatmeal Imperial Stout | ... | strictly4DK | 4.5 | 22 oz bottle poured into a flute glass, share... | 2009-12-25 17:26:06 |
18007 rows × 13 columns
Select just the rows where the beer_style
contains IPA.
df[df.beer_style.cat.contains('IPA')]
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-22-f0a360065e3d> in <module>() ----> 1 df[df.beer_style.cat.contains('IPA')] AttributeError: 'CategoricalAccessor' object has no attribute 'contains'
df[(df.beer_style.astype(object)).str.contains('IPA')]
abv | beer_id | brewer_id | beer_name | ... | profile_name | review_taste | text | time | |
---|---|---|---|---|---|---|---|---|---|
3 | 9.5 | 28577 | 3818 | Unearthly (Imperial India Pale Ale) | ... | nick76 | 4.0 | The aroma has pine, wood, citrus, caramel, an... | 2009-10-05 21:32:37 |
8 | 6.7 | 6549 | 140 | Northern Hemisphere Harvest Wet Hop Ale | ... | david18 | 4.0 | I like all of Sierra Nevada's beers but felt ... | 2009-10-05 21:34:31 |
16 | 8.0 | 36179 | 3818 | Hoppe (Imperial Extra Pale Ale) | ... | nick76 | 3.0 | The aroma is papery with citrus, yeast, and s... | 2009-10-05 21:43:23 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49947 | 7.0 | 709 | 199 | Big Eye IPA | ... | billshmeinke | 4.0 | 12oz into my pint glass.\t\tA: Golden honey a... | 2009-12-25 08:48:11 |
49984 | 6.0 | 38388 | 3718 | L'IPA Du Lièvre | ... | hlance | 4.5 | I love this beer, seek it out when I am in Mo... | 2009-12-25 16:25:45 |
49996 | 8.0 | 7971 | 863 | Pliny The Elder | ... | hrking | 4.0 | This is a big hoppy monster of an IPA..If you... | 2009-12-25 17:23:24 |
7774 rows × 13 columns
cats = (df
.beer_style
.cat
.categories[df.beer_style.cat.categories.str.contains('IPA')]
)
cats
Index(['American Double / Imperial IPA', 'American IPA', 'Belgian IPA', 'English India Pale Ale (IPA)'], dtype='object')
df.beer_style.isin(cats)
0 False 1 False 2 False ... 49997 False 49998 False 49999 False Name: beer_style, dtype: bool
df[df.beer_style.isin(cats)]
abv | beer_id | brewer_id | beer_name | ... | profile_name | review_taste | text | time | |
---|---|---|---|---|---|---|---|---|---|
3 | 9.5 | 28577 | 3818 | Unearthly (Imperial India Pale Ale) | ... | nick76 | 4.0 | The aroma has pine, wood, citrus, caramel, an... | 2009-10-05 21:32:37 |
8 | 6.7 | 6549 | 140 | Northern Hemisphere Harvest Wet Hop Ale | ... | david18 | 4.0 | I like all of Sierra Nevada's beers but felt ... | 2009-10-05 21:34:31 |
16 | 8.0 | 36179 | 3818 | Hoppe (Imperial Extra Pale Ale) | ... | nick76 | 3.0 | The aroma is papery with citrus, yeast, and s... | 2009-10-05 21:43:23 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49947 | 7.0 | 709 | 199 | Big Eye IPA | ... | billshmeinke | 4.0 | 12oz into my pint glass.\t\tA: Golden honey a... | 2009-12-25 08:48:11 |
49984 | 6.0 | 38388 | 3718 | L'IPA Du Lièvre | ... | hlance | 4.5 | I love this beer, seek it out when I am in Mo... | 2009-12-25 16:25:45 |
49996 | 8.0 | 7971 | 863 | Pliny The Elder | ... | hrking | 4.0 | This is a big hoppy monster of an IPA..If you... | 2009-12-25 17:23:24 |
7774 rows × 13 columns
This is quite powerful. Any method that returns a boolean array is potentially an indexer.
beer_ids = df.beer_id.value_counts()
beer_ids
1904 240 53863 208 52441 158 ... 41285 1 47430 1 53274 1 Name: beer_id, dtype: int64
df[df.beer_id.isin(beer_ids[0:3].index)]
abv | beer_id | brewer_id | beer_name | ... | profile_name | review_taste | text | time | |
---|---|---|---|---|---|---|---|---|---|
142 | 8.6 | 52441 | 147 | Stone 09.09.09 Vertical Epic Ale | ... | aubuc1 | 4.5 | Poured in to a chimay goblet.\t\tPours black ... | 2009-10-06 00:10:46 |
446 | 8.6 | 52441 | 147 | Stone 09.09.09 Vertical Epic Ale | ... | corby112 | 3.5 | Pours pitch black and completely opaque witho... | 2009-10-06 07:59:01 |
714 | 8.6 | 52441 | 147 | Stone 09.09.09 Vertical Epic Ale | ... | alcstradamus | 3.5 | Near black pour with a mountainous 3 finger h... | 2009-10-06 21:31:06 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49715 | 6.8 | 1904 | 140 | Sierra Nevada Celebration Ale | ... | betterbgood | 2.5 | I've tried this beer on tap about a year ago,... | 2009-12-25 01:13:38 |
49844 | 8.6 | 52441 | 147 | Stone 09.09.09 Vertical Epic Ale | ... | russpowell | 4.0 | thanks to colonelforbin for this one!\t\tPour... | 2009-12-25 04:10:49 |
49863 | 5.5 | 53863 | 28 | Our Special Ale 2009 (Anchor Christmas Ale) | ... | FickleBeast | 4.0 | 12 oz bottle poured into a pint glass \t\tPou... | 2009-12-25 04:38:07 |
606 rows × 13 columns
df.head(5)
abv | beer_id | brewer_id | beer_name | ... | profile_name | review_taste | text | time | |
---|---|---|---|---|---|---|---|---|---|
0 | 7.0 | 2511 | 287 | Bell's Cherry Stout | ... | blaheath | 4.5 | Batch 8144\tPitch black in color with a 1/2 f... | 2009-10-05 21:31:48 |
1 | 5.7 | 19736 | 9790 | Duck-Rabbit Porter | ... | GJ40 | 4.0 | Sampled from a 12oz bottle in a standard pint... | 2009-10-05 21:32:09 |
2 | 4.8 | 11098 | 3182 | Fürstenberg Premium Pilsener | ... | biegaman | 3.5 | Haystack yellow with an energetic group of bu... | 2009-10-05 21:32:13 |
3 | 9.5 | 28577 | 3818 | Unearthly (Imperial India Pale Ale) | ... | nick76 | 4.0 | The aroma has pine, wood, citrus, caramel, an... | 2009-10-05 21:32:37 |
4 | 5.8 | 398 | 119 | Wolaver's Pale Ale | ... | champ103 | 3.0 | A: Pours a slightly hazy golden/orange color.... | 2009-10-05 21:33:14 |
5 rows × 13 columns
df.iloc[[2,5,10]]
abv | beer_id | brewer_id | beer_name | ... | profile_name | review_taste | text | time | |
---|---|---|---|---|---|---|---|---|---|
2 | 4.8 | 11098 | 3182 | Fürstenberg Premium Pilsener | ... | biegaman | 3.5 | Haystack yellow with an energetic group of bu... | 2009-10-05 21:32:13 |
5 | 7.0 | 966 | 365 | Pike Street XXXXX Stout | ... | sprucetip | 4.5 | From notes. Pours black, thin mocha head fade... | 2009-10-05 21:33:48 |
10 | 11.8 | 43670 | 423 | Bourbon Barrel Quad (BBQ) | ... | blaheath | 4.5 | Burnt amber in color with a 1/4" head. Aroma ... | 2009-10-05 21:36:03 |
3 rows × 13 columns
df.iloc[[2,5,10],0:3]
abv | beer_id | brewer_id | |
---|---|---|---|
2 | 4.8 | 11098 | 3182 |
5 | 7.0 | 966 | 365 |
10 | 11.8 | 43670 | 423 |
df.loc[[2,5,10],['beer_id','time']]
beer_id | time | |
---|---|---|
2 | 11098 | 2009-10-05 21:32:13 |
5 | 966 | 2009-10-05 21:33:48 |
10 | 43670 | 2009-10-05 21:36:03 |
df.loc[df.beer_id.isin(beer_ids[0:3].index),['beer_id','time']]
beer_id | time | |
---|---|---|
142 | 52441 | 2009-10-06 00:10:46 |
446 | 52441 | 2009-10-06 07:59:01 |
714 | 52441 | 2009-10-06 21:31:06 |
... | ... | ... |
49715 | 1904 | 2009-12-25 01:13:38 |
49844 | 52441 | 2009-12-25 04:10:49 |
49863 | 53863 | 2009-12-25 04:38:07 |
606 rows × 2 columns
.loc
.ix
df.index.is_unique
True
df.set_index('beer_id').index.is_unique
False
One of the most powerful and most complicated features of pandas. Let's you represent high-dimensional datasets in a table.
reviews = df.set_index(['profile_name', 'beer_id', 'time']).sort_index()
reviews.head()
abv | brewer_id | beer_name | beer_style | ... | review_overall | review_palate | review_taste | text | |||
---|---|---|---|---|---|---|---|---|---|---|---|
profile_name | beer_id | time | |||||||||
01121987 | 29077 | 2009-11-30 03:44:42 | 9.0 | 11256 | Corne De Brume | Scotch Ale / Wee Heavy | ... | 5 | 4 | 3.5 | Poured into a belgian beer glass, not great h... |
05Harley | 1307 | 2009-10-06 00:10:06 | 8.5 | 428 | Der Weisse Bock | Weizenbock | ... | 4 | 4 | 4.0 | Can't find the date on this one.\t\tPurchased... |
2732 | 2009-12-12 01:21:36 | 8.0 | 287 | Bell's Consecrator Doppelbock | Doppelbock | ... | 4 | 4 | 4.5 | Bottle # 8881 (02/09)\t\tPurchased through We... | |
2899 | 2009-10-20 22:27:01 | 7.1 | 911 | Andechser Doppelbock Dunkel | Doppelbock | ... | 5 | 4 | 5.0 | Bottle # 300310\t\tPurchased through Kracked ... | |
3054 | 2009-11-21 02:17:41 | 5.0 | 946 | Piton Lager Beer | American Adjunct Lager | ... | 3 | 2 | 3.0 | Bottled in 2007.\t\tPurchased in St. Lucia @ ... |
5 rows × 10 columns
reviews.index.is_unique
True
top_reviewers = (reviews
.index
.get_level_values('profile_name')
.value_counts()
.head(5)
)
top_reviewers
drabmuh 242 corby112 230 BeerFMAndy 202 northyorksammy 201 mrmanning 187 Name: profile_name, dtype: int64
pd.options.display.max_rows=4
reviews.loc[top_reviewers.index, :, :]
abv | brewer_id | beer_name | beer_style | ... | review_overall | review_palate | review_taste | text | |||
---|---|---|---|---|---|---|---|---|---|---|---|
profile_name | beer_id | time | |||||||||
BeerFMAndy | 92 | 2009-12-24 21:51:46 | 7.2 | 147 | Arrogant Bastard Ale | American Strong Ale | ... | 4.5 | 4.0 | 4.0 | 22 oz bottle poured into a Sierra Nevada Impe... |
100 | 2009-10-22 03:39:21 | 5.7 | 306 | Blue Moon Harvest Moon Pumpkin Ale | Pumpkin Ale | ... | 2.0 | 2.5 | 2.0 | 12 oz bottle poured into an Imperial Pint. Ma... | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
northyorksammy | 54918 | 2009-12-21 14:56:29 | 12.0 | 2097 | Kuhnhenn Bonicci Barley Wine | American Barleywine | ... | 3.0 | 3.5 | 3.5 | Winner of the amateur brewing fest,Jerry Boni... |
54919 | 2009-12-21 14:59:39 | 7.0 | 2097 | Kuhnhenn Foreign Export Stout | Foreign / Export Stout | ... | 3.5 | 4.0 | 3.5 | Thick stout, a study in chocolate. Very smoot... |
1062 rows × 10 columns
reviews.loc[[('BeerFMAndy',100,pd.Timestamp('2009-10-22 03:39:21'))]]
abv | brewer_id | beer_name | beer_style | ... | review_overall | review_palate | review_taste | text | |||
---|---|---|---|---|---|---|---|---|---|---|---|
profile_name | beer_id | time | |||||||||
BeerFMAndy | 100 | 2009-10-22 03:39:21 | 5.7 | 306 | Blue Moon Harvest Moon Pumpkin Ale | Pumpkin Ale | ... | 2 | 2.5 | 2 | 12 oz bottle poured into an Imperial Pint. Ma... |
1 rows × 10 columns
idx = pd.IndexSlice
reviews.loc[idx[top_reviewers.index, [92,54919], :],
['beer_name','beer_style']]
beer_name | beer_style | |||
---|---|---|---|---|
profile_name | beer_id | time | ||
BeerFMAndy | 92 | 2009-12-24 21:51:46 | Arrogant Bastard Ale | American Strong Ale |
northyorksammy | 54919 | 2009-12-21 14:59:39 | Kuhnhenn Foreign Export Stout | Foreign / Export Stout |
(reviews
.query('profile_name in @top_reviewers.index and \
beer_id in [92,54919]')
[['beer_name','beer_style']]
)
beer_name | beer_style | |||
---|---|---|---|---|
profile_name | beer_id | time | ||
BeerFMAndy | 92 | 2009-12-24 21:51:46 | Arrogant Bastard Ale | American Strong Ale |
northyorksammy | 54919 | 2009-12-21 14:59:39 | Kuhnhenn Foreign Export Stout | Foreign / Export Stout |
reviews.loc[idx[:, [92, 54919], :], ['beer_name' ,'beer_style']]
beer_name | beer_style | |||
---|---|---|---|---|
profile_name | beer_id | time | ||
Arnie2709 | 92 | 2009-11-23 17:16:39 | Arrogant Bastard Ale | American Strong Ale |
BatsforBeer | 92 | 2009-11-06 04:10:13 | Arrogant Bastard Ale | American Strong Ale |
... | ... | ... | ... | ... |
wahhmaster | 92 | 2009-10-07 18:02:47 | Arrogant Bastard Ale | American Strong Ale |
woosterbill | 92 | 2009-11-12 03:00:54 | Arrogant Bastard Ale | American Strong Ale |
72 rows × 2 columns