Notebook

Persistence of Vectorizer and Classifier¶

Right now, we are doing first training and then using the vectorizer, feature selector and classifier immediately after training. However, in practice this scenario is one of the least likely occurring one. You generally train your classifier once, and then you want to use that as much as you'd like. In order to do so, we need to persist the vectorizer, feature selector and classifier.

Pipeline Serialization¶

In last notebook, I showed that Pipeline structure gives a nice way to combine and put together a single component for all vectorizer, feature selector and also classifier(for the sequential pipeline). Instead of serializing two and possibly three structures, we would use pipeline to handle the serialization and deserialization. Without loss of generality, the things that I will show in this notebook is applicable to independent and separate components of the system(namely vectorizer, fature selector and classifier) as well. However, pipeline is preffered way to persist your whole machine learning pipeline.

In [13]:

%matplotlib inline
import csv
import matplotlib.pyplot as plt
import numpy as np
import os
from sklearn import cross_validation
from sklearn import ensemble
from sklearn.feature_extraction import text
from sklearn import feature_extraction
from sklearn import feature_selection
from sklearn import linear_model
from sklearn import metrics
from sklearn import naive_bayes
from sklearn import pipeline
from sklearn import svm
from sklearn import tree
from sklearn import externals

_DATA_DIR = 'data'
_NYT_DATA_PATH = os.path.join(_DATA_DIR, 'nyt_title_data.csv')
_SERIALIZATION_DIR = 'serializations'
_SERIALIZED_PIPELINE_NAME = 'pipe.pickle'
_SERIALIZATION_PATH = os.path.join(_SERIALIZATION_DIR, _SERIALIZED_PIPELINE_NAME)

In [2]:

with open(_NYT_DATA_PATH) as nyt:
    nyt_data = []
    nyt_labels = []
    csv_reader = csv.reader(nyt)
    for line in csv_reader:
      nyt_labels.append(int(line[0]))
      nyt_data.append(line[1])

In [3]:

X = np.array([''.join(el) for el in nyt_data])
y = np.array([el for el in nyt_labels])

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y)

vectorizer = text.TfidfVectorizer(min_df=2, 
 ngram_range=(1, 2), 
 stop_words='english', 
 strip_accents='unicode', 
 norm='l2')

In [4]:

pipe = pipeline.Pipeline([("vectorizer", vectorizer), ("svm", linear_model.RidgeClassifier())])

In [5]:

pipe.fit(X_train, y_train)

Out[5]:

Pipeline(steps=[('vectorizer', TfidfVectorizer(analyzer=u'word', binary=False, charset=None,
        charset_error=None, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=2,
        ngram_range=(1, 2)...copy_X=True, fit_intercept=True,
        max_iter=None, normalize=False, solver='auto', tol=0.001))])

Up to here, everything is same with previous notebook, there should be no surprises.

Create a Serialization Directory if it does not exist already¶

In [8]:

if not os.path.exists(_SERIALIZATION_DIR):
    os.makedirs(_SERIALIZATION_DIR)

In [12]:

externals.joblib.dump(pipe, _SERIALIZATION_PATH)

Out[12]:

['serializations/pipe.pickle',
 'serializations/pipe.pickle_01.npy',
 'serializations/pipe.pickle_02.npy',
 'serializations/pipe.pickle_03.npy',
 'serializations/pipe.pickle_04.npy',
 'serializations/pipe.pickle_05.npy']

joblib.dump returns a list of filenames. Each individual numpy array contained in the clf object is serialized as a separate file on the filesystem. All files are required in the same folder when reloading the model with joblib.load.

By this point, the serialization is complete and ready for deployment. If we were not using Pipeline, we would need at least two serialization for vectorizer and classifier(feature seelector would be third if one uses that). In order to deploy the model, let's deserialize the pipeline in a very similar manner.

In [14]:

pipe = externals.joblib.load(_SERIALIZATION_PATH)

In [15]:

pipe

Out[15]:

Pipeline(steps=[('vectorizer', TfidfVectorizer(analyzer=u'word', binary=False, charset=None,
        charset_error=None, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=2,
        ngram_range=(1, 2)...copy_X=True, fit_intercept=True,
        max_iter=None, normalize=False, solver='auto', tol=0.001))])

We successfully persisted our pipeline and loaded into the namespace again, ready to apply our test set.

Never, ever unpickle untrusted data! Pickle (the serialization method) in Python which joblib uses under the hood has a lot of issues in terms security and vulnerability.

Let's try the pipeline on the test dataset if it works.

In [16]:

y_test = pipe.predict(X_test)

In [17]:

y_test

Out[17]:

array([19, 19, 19, 19, 19, 16, 20, 16, 16, 16, 19, 19, 20, 19, 20, 15, 20,
       16, 16, 12, 16, 16, 19, 16, 20, 20, 19, 20, 16, 16, 20, 16, 19, 19,
       20, 19, 19, 19, 20, 20, 19, 19, 16, 29, 19, 12, 20, 29, 19, 19, 15,
       19, 20, 20, 19, 12, 16, 19, 19, 12, 16, 19, 19, 16, 29, 20,  3, 15,
       19, 12, 19,  3, 15,  3, 16, 19, 16, 29, 15, 19, 20, 19, 29, 19, 15,
       19, 19, 16, 20, 16, 29, 16, 19, 19, 20, 19, 16, 16, 16, 19,  3, 20,
       16, 16, 19, 16, 19, 16, 12, 16, 20, 19, 20, 20, 20, 12, 19, 19, 19,
       29, 19, 16, 19, 19, 20, 15, 29, 19, 16, 20, 20, 16, 19, 19, 19, 19,
       19,  3, 19, 19, 16, 15, 15, 19, 19, 19,  3,  3, 19, 20,  3,  3, 20,
       19, 20,  3, 16, 20, 16, 16, 19, 20, 20, 20, 16, 19, 15, 16, 19, 20,
       16, 19, 20, 12, 20, 19, 19, 19, 16, 19, 15, 29, 15,  3, 16, 19, 19,
       16, 20, 19, 19, 19, 15, 16, 20, 19, 29, 19, 19, 29, 19, 29, 20, 12,
       19, 29, 19, 19, 19, 19, 29, 19, 16, 16, 19, 20, 20, 19,  3, 20, 16,
        3, 19, 16, 20, 19, 20, 20, 19, 16, 20, 19, 16, 20, 16, 20,  3, 19,
       15, 16, 15, 19, 16, 20, 19, 20, 12, 19, 19, 20, 16, 19, 12, 16, 16,
       15, 12, 19, 20, 19, 16, 20, 19, 19, 19, 19, 16, 12, 16, 19, 16, 16,
       19, 20, 19, 19, 19, 20, 15, 20, 16, 19,  3, 16, 29, 19, 19, 20, 19,
       12, 16, 29, 19, 20, 19, 20, 19, 19, 19, 16, 20, 20, 19, 16, 19, 20,
       29, 16, 19, 16, 16, 16, 19, 19, 19,  3, 20, 20, 19,  3,  3, 20, 29,
       16, 19, 19, 16, 19, 16, 19, 20, 19, 16, 19, 20, 19, 16, 15,  3, 20,
       15, 19, 16, 15,  3, 12, 19, 19, 15, 20, 19,  3, 20, 16, 19, 16, 20,
       20, 15, 16, 19, 16, 19, 20, 20, 12, 16, 19,  3, 29, 12, 19, 16, 19,
       15, 20,  3, 16, 19, 19, 16, 19, 19, 15, 16, 19,  3, 20, 19, 19, 20,
       20,  3, 19, 16, 19, 19, 12, 19, 16,  3, 16, 19, 20, 19, 19, 19, 16,
       20, 19, 16, 19, 16, 19, 29, 16, 16, 19,  3, 16, 16, 16,  3, 29, 16,
       20, 19, 16, 19, 12, 29, 16, 20, 15, 16, 20, 16, 20, 19, 19, 16, 19,
       16, 20, 19, 19, 12, 19, 20, 16,  3, 20, 20,  3, 16, 15, 19,  3, 16,
        3, 20, 20, 20, 20, 16, 16,  3, 19, 15, 12, 16, 16, 16, 19, 16, 19,
       19, 16, 20, 20, 16, 19, 19, 19, 19, 19, 19, 16, 16,  3, 19, 20, 16,
        3, 20, 16, 12, 19, 12, 16, 19, 19, 19, 19, 16, 20, 19, 19, 20, 16,
       16, 16,  3, 29, 12, 19, 16, 16, 16, 16, 19, 19, 15, 20, 12, 20, 19,
       19, 20, 19, 16, 16, 15, 19, 19, 20, 20, 19, 20, 20, 16])

Takeaways¶

Pipeline is not just useful for abstraction but also easier to maintain, persist and deploy.
Do not use a loss-compression technique to compress serialization.(Tarballs would work fine)
If you want to create exact environment of your training set, use always virtualenv and note the version numbers of the libraries that you are using.(See the Notebook 0)
Some of the algorithms support partial_fit function for online learning. (SGD Classifier, Perceptron, MultinomialNB). If you have incremental data that you want to improve your classifier over time, you may want to persist your models and then use partial_fit to improve them when you have new data. Works like a charm!