Tutorial on text classification

Analyzing Amazon product reviews

Yury Kashnitskiy, Data Science Lab

We'll be analyzing Amazon products reviews. We took a sample of 100k grocery reviews. The prepared zipped .csv file is here.

Outline:

  1. Simple text features
    1.1. Bag of Words
    1.2. Tf-Idf vectorization
  2. Simple text classification
  3. Understanding the model
    3.1. Confusion matrix
    3.2. Visualizing coefficients
    3.3. ELI5 ("Explain Like I'm 5")
  4. Hierarchical text classification
In [1]:
PATH_TO_DATA = '/home/yorko/Documents/data/amazon_reviews_sample100k_grocery.csv.zip'
In [2]:
# some necessary imports
import os
import pickle
import json
from pprint import pprint
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns
from matplotlib import pyplot as plt
In [3]:
df = pd.read_csv(PATH_TO_DATA)
In [4]:
df.head()
Out[4]:
productId Title userId Helpfulness Score Time Text Cat1 Cat2 Cat3
0 B0000DF3IX Paprika Hungarian Sweet A244MHL2UN2EYL 0/0 5.0 1127088000 While in Hungary we were given a recipe for Hu... grocery gourmet food herbs spices seasonings
1 B0002QF1LK Quaker Honey Graham Oh's 10.5 oz - (6 pack) A3FL7SXVYMC5NR 3/3 5.0 1138147200 Without a doubt, I would recommend this wholes... grocery gourmet food breakfast foods cereals
2 B0002QF1LK Quaker Honey Graham Oh's 10.5 oz - (6 pack) A12IDQSS4OW33B 3/3 5.0 1118016000 This cereal is so sweet....yet so good for you... grocery gourmet food breakfast foods cereals
3 B0002QF1LK Quaker Honey Graham Oh's 10.5 oz - (6 pack) A2GZKHC1M4PKF4 2/2 3.0 1206489600 Man I love Oh's cereal. It is really great to ... grocery gourmet food breakfast foods cereals
4 B0002QF1LK Quaker Honey Graham Oh's 10.5 oz - (6 pack) AUGT2DOGKLHIN 2/2 5.0 1177545600 And I've tried alot of cereals. This is by far... grocery gourmet food breakfast foods cereals
In [5]:
df.shape
Out[5]:
(99982, 10)
In [6]:
df.columns
Out[6]:
Index(['productId', 'Title', 'userId', 'Helpfulness', 'Score', 'Time', 'Text',
       'Cat1', 'Cat2', 'Cat3'],
      dtype='object')

From these 10 columns we'll use only 3 now:

  • Text - review on the product
  • Cat2 - label of category 2 for this product
  • Cat3 - label of category 3 for this product

There's a taxonomy (hierarchical catalog) of all products with 3 categories (a.k.a. levels). Based on the review, we're going to classify it into one of level 2 categories (i.e. predicting Cat2) and level 3 categories (i.e. predicting Cat3).

We're not intrested anymore in Cat1 because here we chose only grocery. So we have 16 Cat2 categories and 157 Cat3 categories.

In [7]:
df['Cat1'].unique()
Out[7]:
array([' grocery  gourmet food'], dtype=object)
In [8]:
df['Cat2'].value_counts()
Out[8]:
pantry staples                       27291
beverages                            23440
snack food                           12724
candy chocolate                      11433
breakfast foods                       6248
breads  bakery                        4240
cooking  baking supplies              2444
herbs                                 2069
gourmet gifts                         1939
fresh flowers  live indoor plants     1811
baby food                             1270
meat  poultry                         1268
meat  seafood                         1250
produce                               1196
sauces  dips                           845
dairy  eggs                            514
Name: Cat2, dtype: int64
In [9]:
df['Cat3'].nunique()
Out[9]:
157

1. Simple text features

1.1. Bag of Words

The following explanation of Bag of Words and Tf-Idf is based on this notebook from our course mlcourse.ai.

The easiest way to convert text to features is called Bag of Words: we create a vector with the length of the vocabulary, compute the number of occurrences of each word in the text, and place that number of occurrences in the appropriate position in the vector. The process described looks simpler in code:

In [10]:
texts = ['i have a cat', 
         'you have a dog', 
         'you and i have a cat and a dog']

vocabulary = list(enumerate(set([word for sentence in texts 
                                 for word in sentence.split()])))
print('Vocabulary:', vocabulary)

def vectorize(text): 
    vector = np.zeros(len(vocabulary)) 
    for i, word in vocabulary:
        num = 0 
        for w in text: 
            if w == word: 
                num += 1 
        if num: 
            vector[i] = num 
    return vector

print('Vectors:')
for sentence in texts: 
    print(vectorize(sentence.split()))
Vocabulary: [(0, 'i'), (1, 'dog'), (2, 'and'), (3, 'a'), (4, 'cat'), (5, 'you'), (6, 'have')]
Vectors:
[1. 0. 0. 1. 1. 0. 1.]
[0. 1. 0. 1. 0. 1. 1.]
[1. 1. 2. 2. 1. 1. 1.]
In [11]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()
print('Feature matrix:\n {}'.format(vect.fit_transform(texts).toarray()))
print('Vocabulary')
pprint(vect.vocabulary_)
Feature matrix:
 [[0 1 0 1 0]
 [0 0 1 1 1]
 [2 1 1 1 1]]
Vocabulary
{'and': 0, 'cat': 1, 'dog': 2, 'have': 3, 'you': 4}

When using algorithms like Bag of Words, we lose the order of the words in the text, which means that the texts "i have no cows" and "no, i have cows" will appear identical after vectorization when, in fact, they have the opposite meaning. To avoid this problem, we can revisit our tokenization step and use N-grams (the sequence of N consecutive tokens) instead.

In [12]:
# the same but with bigrams
vect2 = CountVectorizer(ngram_range=(1, 2))
print('Feature matrix:\n {}'.format(vect2.fit_transform(texts).toarray()))
print('Vocabulary')
pprint(vect2.vocabulary_)
Feature matrix:
 [[0 0 0 1 0 0 1 1 0 0 0 0]
 [0 0 0 0 0 1 1 0 1 1 0 1]
 [2 1 1 1 1 1 1 1 0 1 1 0]]
Vocabulary
{'and': 0,
 'and dog': 1,
 'and have': 2,
 'cat': 3,
 'cat and': 4,
 'dog': 5,
 'have': 6,
 'have cat': 7,
 'have dog': 8,
 'you': 9,
 'you and': 10,
 'you have': 11}

1.2. Tf-Idf

Adding onto the Bag of Words idea: words that are rarely found in the corpus (in all the documents of this dataset) but are present in this particular document might be more important. Then it makes sense to increase the weight of more domain-specific words to separate them out from common words. This approach is called TF-IDF (term frequency-inverse document frequency), which cannot be written in a few lines, so you should look into the details in references such as this wiki. The default option is as follows:

$$ \large idf(t,D) = \log\frac{\mid D\mid}{df(d,t)+1} $$$$ \large tfidf(t,d,D) = tf(t,d) \times idf(t,D) $$

2. Simple text classification

For now, we'll only take a look at 16 level 2 categories. We'll be doing a 16-class classification with logistic regression and Tf-Idf vectorization. Here we resort to Sklearn pipelines.

In [13]:
# build bigrams, put a limit on maximal number of features
# and minimal word frequency
tf_idf = TfidfVectorizer(ngram_range=(1, 2), max_features=50000, min_df=2)
# multinomial logistic regression a.k.a softmax classifier
logit = LogisticRegression(C=1e2, n_jobs=4, solver='lbfgs', 
                           random_state=17, multi_class='multinomial',
                           verbose=1)
# sklearn's pipeline
tfidf_logit_pipeline = Pipeline([('tf_idf', tf_idf), 
                                 ('logit', logit)])

For now, we only use review text.

In [14]:
texts, y = df['Text'], df['Cat2']

We split data into training and validation parts.

In [15]:
train_texts, valid_texts, y_train, y_valid = \
        train_test_split(texts, y, random_state=17,
                         stratify=y, shuffle=True)
In [16]:
%%time
tfidf_logit_pipeline.fit(train_texts, y_train)
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
CPU times: user 10.1 s, sys: 300 ms, total: 10.4 s
Wall time: 46.3 s
[Parallel(n_jobs=4)]: Done   1 out of   1 | elapsed:   36.0s finished
Out[16]:
Pipeline(memory=None,
     steps=[('tf_idf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=50000, min_df=2,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=Tru... penalty='l2', random_state=17, solver='lbfgs',
          tol=0.0001, verbose=1, warm_start=False))])
In [17]:
%%time
valid_pred = tfidf_logit_pipeline.predict(valid_texts)
CPU times: user 2.42 s, sys: 7.79 ms, total: 2.43 s
Wall time: 2.43 s
In [18]:
accuracy_score(y_valid, valid_pred)
Out[18]:
0.7564810369659145

3. Understanding the model

3.1. Confusion matrix

In [19]:
def plot_confusion_matrix(actual, predicted, classes,
                          normalize=False,
                          title='Confusion matrix', figsize=(7,7),
                          cmap=plt.cm.Blues, path_to_save_fig=None):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    import itertools
    from sklearn.metrics import confusion_matrix
    cm = confusion_matrix(actual, predicted).T
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    
    plt.figure(figsize=figsize)
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=90)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('Predicted label')
    plt.xlabel('True label')
    
    if path_to_save_fig:
        plt.savefig(path_to_save_fig, dpi=300, bbox_inches='tight')
In [20]:
category2_classes = tfidf_logit_pipeline.named_steps['logit'].classes_
category2_classes
Out[20]:
array(['baby food', 'beverages', 'breads  bakery', 'breakfast foods',
       'candy chocolate', 'cooking  baking supplies', 'dairy  eggs',
       'fresh flowers  live indoor plants', 'gourmet gifts', 'herbs',
       'meat  poultry', 'meat  seafood', 'pantry staples', 'produce',
       'sauces  dips', 'snack food'], dtype=object)
In [21]:
plot_confusion_matrix(y_valid, valid_pred, 
                      category2_classes, figsize=(8, 8))

3.2. Visualizing coefficients

In [22]:
def visualize_coefficients(classifier_coefs, feature_names, 
                           n_top_features=25, title='Coefs', 
                           save_path=None):
    # get coefficients with large absolute values 
    coef = classifier_coefs.ravel()
    positive_coefficients = np.argsort(coef)[-n_top_features:]
    negative_coefficients = np.argsort(coef)[:n_top_features]
    interesting_coefficients = np.hstack([negative_coefficients, 
                                          positive_coefficients])
    # plot them
    plt.figure(figsize=(15, 5))
    colors = ["red" if c < 0 else "blue" 
              for c in coef[interesting_coefficients]]
    plt.bar(np.arange(2 * n_top_features), 
            coef[interesting_coefficients], color=colors)
    feature_names = np.array(feature_names)
    plt.xticks(np.arange(1, 1 + 2 * n_top_features), 
               feature_names[interesting_coefficients], 
               rotation=90, ha="right")
    plt.title(title);
    if save_path:
        plt.savefig(save_path, dpi=300);
In [23]:
visualize_coefficients(tfidf_logit_pipeline.named_steps['logit'].coef_[0, :], 
                       tfidf_logit_pipeline.named_steps['tf_idf'].get_feature_names(),
                      title=category2_classes[0])
In [24]:
visualize_coefficients(tfidf_logit_pipeline.named_steps['logit'].coef_[1, :], 
                       tfidf_logit_pipeline.named_steps['tf_idf'].get_feature_names(),
                      title=category2_classes[1])

3.3. ELI5 ("Explain Like I'm 5")

GitHub. ELI5 is a Python package which helps to debug machine learning classifiers and explain their predictions. It supports Sklearn, Xgboost, LightGBM and others.

In [27]:
# pip install eli5
import eli5
In [28]:
eli5.show_weights(estimator=tfidf_logit_pipeline.named_steps['logit'],
                  vec=tfidf_logit_pipeline.named_steps['tf_idf'])
Out[28]:
y=baby food top features y=beverages top features y=breads bakery top features y=breakfast foods top features y=candy chocolate top features y=cooking baking supplies top features y=dairy eggs top features y=fresh flowers live indoor plants top features y=gourmet gifts top features y=herbs top features y=meat poultry top features y=meat seafood top features y=pantry staples top features y=produce top features y=sauces dips top features y=snack food top features
Weight? Feature
+44.371 baby
+39.677 formula
+22.257 gerber
+16.179 similac
+16.148 this formula
+15.863 earth best
+15.532 cereal
+15.284 babies
+13.853 my baby
+13.091 daughter
+12.813 month old
+12.559 earth
+12.135 baby food
+11.852 food
+11.834 old
+11.623 toddler
+11.065 son
+10.936 months
+10.730 month
+10.385 child
… 10790 more positive …
… 39191 more negative …
Weight? Feature
+71.027 tea
+48.800 this tea
+40.395 drink
+34.729 teas
+34.680 pods
+31.821 coffee
+31.040 coconut water
+28.885 movie
+28.634 drinking
+27.084 zico
+26.962 chai
+26.894 hot chocolate
+25.221 soda
+23.568 senseo
+23.189 water
+22.381 espresso
… 20526 more positive …
… 29455 more negative …
-21.645 salt
-22.586 popcorn
-23.213 sauce
-28.835 eat
Weight? Feature
+25.276 cookies
+22.181 cookie
+20.022 cake
+19.781 fruitcake
+19.179 pocky
+18.776 biscotti
+16.200 breadsticks
+15.999 oreos
+15.511 pizza
+15.289 bread
+15.138 baklava
+14.979 cakes
+14.884 wafers
+14.655 wafer
+13.962 mallomars
+12.839 crust
+11.086 shells
+11.055 oreo
+10.833 wraps
… 17490 more positive …
… 32491 more negative …
-14.362 mix
Weight? Feature
+34.379 cereal
+30.655 bars
+29.471 bar
+29.235 oatmeal
+28.824 granola
+22.415 breakfast
+21.553 cereals
+20.874 tarts
+20.341 pop tarts
+18.485 puffed
+18.082 oats
+17.246 blueberry
+16.701 pop
+16.352 these bars
+16.051 this cereal
+15.865 frosted
+15.545 toaster
+13.876 filling
+13.804 this bar
… 17193 more positive …
… 32788 more negative …
-15.753 tea
Weight? Feature
+51.312 licorice
+47.510 gum
+37.105 mints
+34.616 candy
+32.241 altoids
+29.186 haribo
+27.022 candies
+25.932 chocolate
+24.702 bears
+21.744 gummi
+21.503 gummy
+19.630 bar
+19.497 gummies
+19.185 chocolates
+18.243 liquorice
+18.102 jelly
+15.880 this gum
+15.691 belly
… 19000 more positive …
… 30981 more negative …
-22.117 tea
-24.629 cookies
Weight? Feature
+17.945 bread
+14.719 cake
+13.839 vanilla
+13.691 almonds
+13.063 syrup
+13.024 baking
+12.991 mincemeat
+12.832 mix
+12.764 flour
+12.728 muffins
+12.081 cocoa
+12.003 sugar
+11.901 nuts
+11.820 peanuts
+11.633 spoon
+11.012 pancakes
+10.687 salt
+10.680 wasabi
+10.302 splenda
+10.260 chocolate
… 16830 more positive …
… 33151 more negative …
Weight? Feature
+28.332 cheese
+18.710 this cheese
+13.140 milk
+13.119 cheeses
+11.565 coffee
+11.039 creamer
+9.343 cream
+9.145 creamy
+8.321 blue
+7.280 cheese is
+7.036 butter
+6.992 egg
+6.720 creamers
+6.423 lurpak
+6.325 igourmet
+5.795 it
+5.710 ice
+5.665 coffee mate
+5.339 blue cheese
… 8247 more positive …
… 41734 more negative …
-7.579 these
Weight? Feature
+45.917 plant
+42.770 tree
+38.306 bonsai
+32.808 plants
+31.475 herbs
+29.085 grow
+25.124 aerogarden
+24.322 flowers
+22.405 garden
+22.146 growing
+19.220 leaves
+18.608 kit
+18.017 the plant
+15.479 the tree
+14.450 seed
+14.393 pods
+13.557 basil
+13.258 weeks
+12.672 lettuce
… 10332 more positive …
… 39649 more negative …
-14.176 taste
Weight? Feature
+47.095 tea
+29.819 candy
+27.930 basket
+23.935 sushi
+21.573 the tea
+18.165 gift
+16.264 chocolates
+15.424 set
+14.591 flowering
+13.823 teapot
+13.475 this gift
+13.392 kit
+13.051 pot
+12.905 candies
+12.401 teas
+12.350 bamboo
+12.204 coffee
+11.800 hot
+11.167 flower
+10.973 fun
… 13033 more positive …
… 36948 more negative …
Weight? Feature
+21.926 popcorn
+21.711 salt
+21.467 beans
+20.799 seasoning
+15.792 peppercorns
+15.629 cinnamon
+15.521 vanilla
+14.036 spice
+14.001 pepper
+13.502 vanilla beans
+13.081 chili
+12.501 ginger
+12.035 spices
+11.083 seeds
+10.727 curry
+10.309 rub
+10.208 powder
+10.129 this salt
+10.093 used
… 14181 more positive …
… 35800 more negative …
-11.357 chocolate
Weight? Feature
+32.038 jerky
+21.365 slim
+14.512 snack
+14.437 sausage
+12.727 sticks
+12.551 jims
+12.551 slim jims
+12.185 meat
+11.911 chicken
+11.400 salty
+10.772 slim jim
+10.697 jim
+10.616 bacon
+10.613 salami
+10.374 sardines
+10.177 beef
+9.305 pate
+9.155 teriyaki
+8.716 duck
… 11580 more positive …
… 38401 more negative …
-9.532 chocolate
Weight? Feature
+44.230 sardines
+28.676 jerky
+23.535 tuna
+18.209 anchovies
+16.566 lobster
+14.608 salmon
+13.983 meat
+13.665 crab
+13.569 fish
+13.443 clams
+13.291 smoked
+12.194 kippers
+12.156 beef
+11.561 oysters
+11.532 can
+11.367 packed
+10.932 these sardines
+10.344 canned
+10.219 bones
+10.192 prince
… 12081 more positive …
… 37900 more negative …
Weight? Feature
+30.934 soup
+30.484 noodles
+22.171 pasta
+20.552 olives
+20.147 sauce
+19.369 seasoning
+17.525 splenda
+17.464 mac
+17.173 kraft
+16.808 cake
+16.366 beans
+16.338 dressing
+15.891 this soup
… 23590 more positive …
… 26391 more negative …
-15.874 cereal
-15.931 fruit
-16.100 licorice
-17.908 bar
-18.101 this tea
-23.085 tea
-26.895 jerky
Weight? Feature
+43.953 cherries
+33.261 pumpkin
+27.775 dried
+21.389 seaweed
+16.916 fruit
+16.438 dried cherries
+14.810 tart
+12.109 canned
+11.898 these cherries
+11.486 cans
+11.283 dented
+9.935 snack
+9.902 truffles
+9.443 traverse
+9.200 plums
+9.066 mushrooms
+9.056 cherries are
+8.943 canned pumpkin
+8.856 apricots
… 11622 more positive …
… 38359 more negative …
-8.532 chocolate
Weight? Feature
+30.654 sauce
+13.080 salsa
+12.011 paste
+11.276 marinade
+11.229 hot
+11.223 sauces
+11.131 use
+10.508 marmite
+10.208 bottle
+9.444 gravy
+8.884 this sauce
+8.149 chicken
+8.089 thai
+7.877 use it
+7.637 tapatio
+6.823 spicy
+6.664 bottles
+6.491 curry
… 10636 more positive …
… 39345 more negative …
-8.777 they
-10.002 these
Weight? Feature
+38.187 popcorn
+34.106 chips
+29.419 pretzels
+27.025 jerky
+24.860 crackers
+22.564 cracker
+22.472 this popcorn
+21.880 cookies
+18.115 bloks
+17.453 cookie
+17.322 rice cakes
+16.863 chip
+16.522 pretzel
+16.463 raisins
+16.040 hummus
+15.033 sahale
+15.015 snack
+14.812 granola
… 20111 more positive …
… 29870 more negative …
-15.215 beans
-22.005 tea
In [29]:
train_texts[0], y_train[0]
Out[29]:
('While in Hungary we were given a recipe for Hungarian Goulash. It needs sweet paprika. This was terrific in that dish and others. I will purchase it again when I need more.',
 'herbs')
In [30]:
eli5.show_prediction(estimator=tfidf_logit_pipeline.named_steps['logit'],
                     vec=tfidf_logit_pipeline.named_steps['tf_idf'],
                     doc=train_texts[0])
Out[30]:

y=baby food (probability 0.000, score -2.327) top features

Contribution? Feature
-0.697 Highlighted in text (sum)
-1.630 <BIAS>

while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.

y=beverages (probability 0.000, score -1.968) top features

Contribution? Feature
+2.623 <BIAS>
-4.591 Highlighted in text (sum)

while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.

y=breads bakery (probability 0.001, score 0.055) top features

Contribution? Feature
+0.835 <BIAS>
-0.780 Highlighted in text (sum)

while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.

y=breakfast foods (probability 0.000, score -1.159) top features

Contribution? Feature
+0.279 <BIAS>
-1.438 Highlighted in text (sum)

while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.

y=candy chocolate (probability 0.001, score 0.212) top features

Contribution? Feature
+1.973 <BIAS>
-1.760 Highlighted in text (sum)

while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.

y=cooking baking supplies (probability 0.004, score 1.170) top features

Contribution? Feature
+0.860 <BIAS>
+0.310 Highlighted in text (sum)

while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.

y=dairy eggs (probability 0.001, score -0.053) top features

Contribution? Feature
+0.437 Highlighted in text (sum)
-0.491 <BIAS>

while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.

y=fresh flowers live indoor plants (probability 0.000, score -6.405) top features

Contribution? Feature
-0.509 Highlighted in text (sum)
-5.897 <BIAS>

while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.

y=gourmet gifts (probability 0.000, score -2.366) top features

Contribution? Feature
-0.537 <BIAS>
-1.828 Highlighted in text (sum)

while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.

y=herbs (probability 0.154, score 4.945) top features

Contribution? Feature
+4.215 Highlighted in text (sum)
+0.730 <BIAS>

while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.

y=meat poultry (probability 0.004, score 1.199) top features

Contribution? Feature
+1.214 Highlighted in text (sum)
-0.015 <BIAS>

while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.

y=meat seafood (probability 0.000, score -0.847) top features

Contribution? Feature
+0.606 Highlighted in text (sum)
-1.453 <BIAS>

while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.

y=pantry staples (probability 0.820, score 6.619) top features

Contribution? Feature
+3.366 Highlighted in text (sum)
+3.253 <BIAS>

while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.

y=produce (probability 0.013, score 2.489) top features

Contribution? Feature
+3.578 Highlighted in text (sum)
-1.089 <BIAS>

while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.

y=sauces dips (probability 0.001, score -0.255) top features

Contribution? Feature
+0.471 Highlighted in text (sum)
-0.726 <BIAS>

while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.

y=snack food (probability 0.000, score -1.309) top features

Contribution? Feature
+1.284 <BIAS>
-2.593 Highlighted in text (sum)

while in hungary we were given a recipe for hungarian goulash. it needs sweet paprika. this was terrific in that dish and others. i will purchase it again when i need more.

4. Hierarchical text classification

Now we are going to predict categories 2 and 3 at the same time. It's not straightfoward how you make your category 3 predictions consistent with category 2 predictions. Example: if the model predicts "breakfast foods" as category 2, then it's obliged to predicts subcategories of "breakfast foods" as category 3, for instance, "cereals". But not "spices seasonings". Formally, it's called hierarchical text classification.

In [31]:
# combine categories 2 and 3
df['Cat2_Cat3'] = df['Cat2'] + '/' + df['Cat3']
In [32]:
y_cat2_and_cat3 = df['Cat2_Cat3']
In [33]:
y_cat2_and_cat3.head()
Out[33]:
0    herbs/spices  seasonings
1     breakfast foods/cereals
2     breakfast foods/cereals
3     breakfast foods/cereals
4     breakfast foods/cereals
Name: Cat2_Cat3, dtype: object
In [34]:
train_texts, valid_texts,y_train_cat2_and_cat3, y_valid_cat2_and_cat3 = \
    train_test_split(texts, y_cat2_and_cat3, 
                     random_state=17,
                     stratify=y_cat2_and_cat3, 
                     shuffle=True)
In [35]:
%%time
tfidf_logit_pipeline.fit(train_texts, y_train_cat2_and_cat3)
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
CPU times: user 9.69 s, sys: 257 ms, total: 9.95 s
Wall time: 5min 29s
[Parallel(n_jobs=4)]: Done   1 out of   1 | elapsed:  5.3min finished
Out[35]:
Pipeline(memory=None,
     steps=[('tf_idf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=50000, min_df=2,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=Tru... penalty='l2', random_state=17, solver='lbfgs',
          tol=0.0001, verbose=1, warm_start=False))])
In [36]:
%%time
valid_pred_cat2_and_cat3 = tfidf_logit_pipeline.predict(valid_texts)
CPU times: user 2.52 s, sys: 36.2 ms, total: 2.56 s
Wall time: 2.57 s
In [37]:
cat2_pred = pd.Series(valid_pred_cat2_and_cat3).apply(lambda s: 
                                                      s.split('/')[0])
cat3_pred = pd.Series(valid_pred_cat2_and_cat3).apply(lambda s: 
                                                      s.split('/')[1])
In [38]:
y_valid_cat2 = pd.Series(y_valid_cat2_and_cat3).apply(lambda s: 
                                                      s.split('/')[0])
y_valid_cat3 = pd.Series(y_valid_cat2_and_cat3).apply(lambda s: 
                                                      s.split('/')[1])
In [39]:
accuracy_score(y_valid_cat3, cat3_pred)
Out[39]:
0.6370619299087854
In [40]:
accuracy_score(y_valid_cat2, cat2_pred)
Out[40]:
0.758801408225316

Links: