We'll be analyzing Amazon products reviews. We took a sample of 100k grocery reviews. The prepared zipped .csv
file is here.
Outline:
PATH_TO_DATA = '/home/yorko/Documents/data/amazon_reviews_sample100k_grocery.csv.zip'
# some necessary imports
import os
import pickle
import json
from pprint import pprint
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns
from matplotlib import pyplot as plt
df = pd.read_csv(PATH_TO_DATA)
df.head()
df.shape
df.columns
From these 10 columns we'll use only 3 now:
There's a taxonomy (hierarchical catalog) of all products with 3 categories (a.k.a. levels). Based on the review, we're going to classify it into one of level 2 categories (i.e. predicting Cat2
) and level 3 categories (i.e. predicting Cat3
).
We're not intrested anymore in Cat1
because here we chose only grocery. So we have 16 Cat2
categories and 157 Cat3
categories.
df['Cat1'].unique()
df['Cat2'].value_counts()
df['Cat3'].nunique()
The following explanation of Bag of Words and Tf-Idf is based on this notebook from our course mlcourse.ai.
The easiest way to convert text to features is called Bag of Words: we create a vector with the length of the vocabulary, compute the number of occurrences of each word in the text, and place that number of occurrences in the appropriate position in the vector. The process described looks simpler in code:
texts = ['i have a cat',
'you have a dog',
'you and i have a cat and a dog']
vocabulary = list(enumerate(set([word for sentence in texts
for word in sentence.split()])))
print('Vocabulary:', vocabulary)
def vectorize(text):
vector = np.zeros(len(vocabulary))
for i, word in vocabulary:
num = 0
for w in text:
if w == word:
num += 1
if num:
vector[i] = num
return vector
print('Vectors:')
for sentence in texts:
print(vectorize(sentence.split()))
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
print('Feature matrix:\n {}'.format(vect.fit_transform(texts).toarray()))
print('Vocabulary')
pprint(vect.vocabulary_)
When using algorithms like Bag of Words, we lose the order of the words in the text, which means that the texts "i have no cows" and "no, i have cows" will appear identical after vectorization when, in fact, they have the opposite meaning. To avoid this problem, we can revisit our tokenization step and use N-grams (the sequence of N consecutive tokens) instead.
# the same but with bigrams
vect2 = CountVectorizer(ngram_range=(1, 2))
print('Feature matrix:\n {}'.format(vect2.fit_transform(texts).toarray()))
print('Vocabulary')
pprint(vect2.vocabulary_)
Adding onto the Bag of Words idea: words that are rarely found in the corpus (in all the documents of this dataset) but are present in this particular document might be more important. Then it makes sense to increase the weight of more domain-specific words to separate them out from common words. This approach is called TF-IDF (term frequency-inverse document frequency), which cannot be written in a few lines, so you should look into the details in references such as this wiki. The default option is as follows:
$$ \large idf(t,D) = \log\frac{\mid D\mid}{df(d,t)+1} $$$$ \large tfidf(t,d,D) = tf(t,d) \times idf(t,D) $$For now, we'll only take a look at 16 level 2 categories. We'll be doing a 16-class classification with logistic regression and Tf-Idf vectorization. Here we resort to Sklearn pipelines.
# build bigrams, put a limit on maximal number of features
# and minimal word frequency
tf_idf = TfidfVectorizer(ngram_range=(1, 2), max_features=50000, min_df=2)
# multinomial logistic regression a.k.a softmax classifier
logit = LogisticRegression(C=1e2, n_jobs=4, solver='lbfgs',
random_state=17, multi_class='multinomial',
verbose=1)
# sklearn's pipeline
tfidf_logit_pipeline = Pipeline([('tf_idf', tf_idf),
('logit', logit)])
For now, we only use review text.
texts, y = df['Text'], df['Cat2']
We split data into training and validation parts.
train_texts, valid_texts, y_train, y_valid = \
train_test_split(texts, y, random_state=17,
stratify=y, shuffle=True)
%%time
tfidf_logit_pipeline.fit(train_texts, y_train)
%%time
valid_pred = tfidf_logit_pipeline.predict(valid_texts)
accuracy_score(y_valid, valid_pred)
def plot_confusion_matrix(actual, predicted, classes,
normalize=False,
title='Confusion matrix', figsize=(7,7),
cmap=plt.cm.Blues, path_to_save_fig=None):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
import itertools
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(actual, predicted).T
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
plt.figure(figsize=figsize)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=90)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('Predicted label')
plt.xlabel('True label')
if path_to_save_fig:
plt.savefig(path_to_save_fig, dpi=300, bbox_inches='tight')
category2_classes = tfidf_logit_pipeline.named_steps['logit'].classes_
category2_classes
plot_confusion_matrix(y_valid, valid_pred,
category2_classes, figsize=(8, 8))
def visualize_coefficients(classifier_coefs, feature_names,
n_top_features=25, title='Coefs',
save_path=None):
# get coefficients with large absolute values
coef = classifier_coefs.ravel()
positive_coefficients = np.argsort(coef)[-n_top_features:]
negative_coefficients = np.argsort(coef)[:n_top_features]
interesting_coefficients = np.hstack([negative_coefficients,
positive_coefficients])
# plot them
plt.figure(figsize=(15, 5))
colors = ["red" if c < 0 else "blue"
for c in coef[interesting_coefficients]]
plt.bar(np.arange(2 * n_top_features),
coef[interesting_coefficients], color=colors)
feature_names = np.array(feature_names)
plt.xticks(np.arange(1, 1 + 2 * n_top_features),
feature_names[interesting_coefficients],
rotation=90, ha="right")
plt.title(title);
if save_path:
plt.savefig(save_path, dpi=300);
visualize_coefficients(tfidf_logit_pipeline.named_steps['logit'].coef_[0, :],
tfidf_logit_pipeline.named_steps['tf_idf'].get_feature_names(),
title=category2_classes[0])
visualize_coefficients(tfidf_logit_pipeline.named_steps['logit'].coef_[1, :],
tfidf_logit_pipeline.named_steps['tf_idf'].get_feature_names(),
title=category2_classes[1])
# pip install eli5
import eli5
eli5.show_weights(estimator=tfidf_logit_pipeline.named_steps['logit'],
vec=tfidf_logit_pipeline.named_steps['tf_idf'])
train_texts[0], y_train[0]
eli5.show_prediction(estimator=tfidf_logit_pipeline.named_steps['logit'],
vec=tfidf_logit_pipeline.named_steps['tf_idf'],
doc=train_texts[0])
Now we are going to predict categories 2 and 3 at the same time. It's not straightfoward how you make your category 3 predictions consistent with category 2 predictions. Example: if the model predicts "breakfast foods" as category 2, then it's obliged to predicts subcategories of "breakfast foods" as category 3, for instance, "cereals". But not "spices seasonings". Formally, it's called hierarchical text classification.
# combine categories 2 and 3
df['Cat2_Cat3'] = df['Cat2'] + '/' + df['Cat3']
y_cat2_and_cat3 = df['Cat2_Cat3']
y_cat2_and_cat3.head()
train_texts, valid_texts,y_train_cat2_and_cat3, y_valid_cat2_and_cat3 = \
train_test_split(texts, y_cat2_and_cat3,
random_state=17,
stratify=y_cat2_and_cat3,
shuffle=True)
%%time
tfidf_logit_pipeline.fit(train_texts, y_train_cat2_and_cat3)
%%time
valid_pred_cat2_and_cat3 = tfidf_logit_pipeline.predict(valid_texts)
cat2_pred = pd.Series(valid_pred_cat2_and_cat3).apply(lambda s:
s.split('/')[0])
cat3_pred = pd.Series(valid_pred_cat2_and_cat3).apply(lambda s:
s.split('/')[1])
y_valid_cat2 = pd.Series(y_valid_cat2_and_cat3).apply(lambda s:
s.split('/')[0])
y_valid_cat3 = pd.Series(y_valid_cat2_and_cat3).apply(lambda s:
s.split('/')[1])
accuracy_score(y_valid_cat3, cat3_pred)
accuracy_score(y_valid_cat2, cat2_pred)