Right now, we are doing first training and then using the vectorizer, feature selector and classifier immediately after training. However, in practice this scenario is one of the least likely occurring one. You generally train your classifier once, and then you want to use that as much as you'd like. In order to do so, we need to persist the vectorizer, feature selector and classifier.
In last notebook, I showed that Pipeline
structure gives a nice way to combine and put together a single component for all vectorizer, feature selector and also classifier(for the sequential pipeline). Instead of serializing two and possibly three structures, we would use pipeline
to handle the serialization and deserialization. Without loss of generality, the things that I will show in this notebook is applicable to independent and separate components of the system(namely vectorizer, fature selector and classifier) as well. However, pipeline
is preffered way to persist your whole machine learning pipeline.
%matplotlib inline
import csv
import matplotlib.pyplot as plt
import numpy as np
import os
from sklearn import cross_validation
from sklearn import ensemble
from sklearn.feature_extraction import text
from sklearn import feature_extraction
from sklearn import feature_selection
from sklearn import linear_model
from sklearn import metrics
from sklearn import naive_bayes
from sklearn import pipeline
from sklearn import svm
from sklearn import tree
from sklearn import externals
_DATA_DIR = 'data'
_NYT_DATA_PATH = os.path.join(_DATA_DIR, 'nyt_title_data.csv')
_SERIALIZATION_DIR = 'serializations'
_SERIALIZED_PIPELINE_NAME = 'pipe.pickle'
_SERIALIZATION_PATH = os.path.join(_SERIALIZATION_DIR, _SERIALIZED_PIPELINE_NAME)
with open(_NYT_DATA_PATH) as nyt:
nyt_data = []
nyt_labels = []
csv_reader = csv.reader(nyt)
for line in csv_reader:
nyt_labels.append(int(line[0]))
nyt_data.append(line[1])
X = np.array([''.join(el) for el in nyt_data])
y = np.array([el for el in nyt_labels])
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y)
vectorizer = text.TfidfVectorizer(min_df=2,
ngram_range=(1, 2),
stop_words='english',
strip_accents='unicode',
norm='l2')
pipe = pipeline.Pipeline([("vectorizer", vectorizer), ("svm", linear_model.RidgeClassifier())])
pipe.fit(X_train, y_train)
Pipeline(steps=[('vectorizer', TfidfVectorizer(analyzer=u'word', binary=False, charset=None, charset_error=None, decode_error=u'strict', dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content', lowercase=True, max_df=1.0, max_features=None, min_df=2, ngram_range=(1, 2)...copy_X=True, fit_intercept=True, max_iter=None, normalize=False, solver='auto', tol=0.001))])
Up to here, everything is same with previous notebook, there should be no surprises.
if not os.path.exists(_SERIALIZATION_DIR):
os.makedirs(_SERIALIZATION_DIR)
externals.joblib.dump(pipe, _SERIALIZATION_PATH)
['serializations/pipe.pickle', 'serializations/pipe.pickle_01.npy', 'serializations/pipe.pickle_02.npy', 'serializations/pipe.pickle_03.npy', 'serializations/pipe.pickle_04.npy', 'serializations/pipe.pickle_05.npy']
joblib.dump returns a list of filenames. Each individual numpy array contained in the clf object is serialized as a separate file on the filesystem. All files are required in the same folder when reloading the model with joblib.load.
By this point, the serialization is complete and ready for deployment. If we were not using Pipeline
, we would need at least two serialization for vectorizer and classifier(feature seelector would be third if one uses that). In order to deploy the model, let's deserialize the pipeline
in a very similar manner.
pipe = externals.joblib.load(_SERIALIZATION_PATH)
pipe
Pipeline(steps=[('vectorizer', TfidfVectorizer(analyzer=u'word', binary=False, charset=None, charset_error=None, decode_error=u'strict', dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content', lowercase=True, max_df=1.0, max_features=None, min_df=2, ngram_range=(1, 2)...copy_X=True, fit_intercept=True, max_iter=None, normalize=False, solver='auto', tol=0.001))])
We successfully persisted our pipeline and loaded into the namespace again, ready to apply our test set.
Never, ever unpickle untrusted data!
Pickle
(the serialization method) in Python whichjoblib
uses under the hood has a lot of issues in terms security and vulnerability.
Let's try the pipeline on the test dataset if it works.
y_test = pipe.predict(X_test)
y_test
array([19, 19, 19, 19, 19, 16, 20, 16, 16, 16, 19, 19, 20, 19, 20, 15, 20, 16, 16, 12, 16, 16, 19, 16, 20, 20, 19, 20, 16, 16, 20, 16, 19, 19, 20, 19, 19, 19, 20, 20, 19, 19, 16, 29, 19, 12, 20, 29, 19, 19, 15, 19, 20, 20, 19, 12, 16, 19, 19, 12, 16, 19, 19, 16, 29, 20, 3, 15, 19, 12, 19, 3, 15, 3, 16, 19, 16, 29, 15, 19, 20, 19, 29, 19, 15, 19, 19, 16, 20, 16, 29, 16, 19, 19, 20, 19, 16, 16, 16, 19, 3, 20, 16, 16, 19, 16, 19, 16, 12, 16, 20, 19, 20, 20, 20, 12, 19, 19, 19, 29, 19, 16, 19, 19, 20, 15, 29, 19, 16, 20, 20, 16, 19, 19, 19, 19, 19, 3, 19, 19, 16, 15, 15, 19, 19, 19, 3, 3, 19, 20, 3, 3, 20, 19, 20, 3, 16, 20, 16, 16, 19, 20, 20, 20, 16, 19, 15, 16, 19, 20, 16, 19, 20, 12, 20, 19, 19, 19, 16, 19, 15, 29, 15, 3, 16, 19, 19, 16, 20, 19, 19, 19, 15, 16, 20, 19, 29, 19, 19, 29, 19, 29, 20, 12, 19, 29, 19, 19, 19, 19, 29, 19, 16, 16, 19, 20, 20, 19, 3, 20, 16, 3, 19, 16, 20, 19, 20, 20, 19, 16, 20, 19, 16, 20, 16, 20, 3, 19, 15, 16, 15, 19, 16, 20, 19, 20, 12, 19, 19, 20, 16, 19, 12, 16, 16, 15, 12, 19, 20, 19, 16, 20, 19, 19, 19, 19, 16, 12, 16, 19, 16, 16, 19, 20, 19, 19, 19, 20, 15, 20, 16, 19, 3, 16, 29, 19, 19, 20, 19, 12, 16, 29, 19, 20, 19, 20, 19, 19, 19, 16, 20, 20, 19, 16, 19, 20, 29, 16, 19, 16, 16, 16, 19, 19, 19, 3, 20, 20, 19, 3, 3, 20, 29, 16, 19, 19, 16, 19, 16, 19, 20, 19, 16, 19, 20, 19, 16, 15, 3, 20, 15, 19, 16, 15, 3, 12, 19, 19, 15, 20, 19, 3, 20, 16, 19, 16, 20, 20, 15, 16, 19, 16, 19, 20, 20, 12, 16, 19, 3, 29, 12, 19, 16, 19, 15, 20, 3, 16, 19, 19, 16, 19, 19, 15, 16, 19, 3, 20, 19, 19, 20, 20, 3, 19, 16, 19, 19, 12, 19, 16, 3, 16, 19, 20, 19, 19, 19, 16, 20, 19, 16, 19, 16, 19, 29, 16, 16, 19, 3, 16, 16, 16, 3, 29, 16, 20, 19, 16, 19, 12, 29, 16, 20, 15, 16, 20, 16, 20, 19, 19, 16, 19, 16, 20, 19, 19, 12, 19, 20, 16, 3, 20, 20, 3, 16, 15, 19, 3, 16, 3, 20, 20, 20, 20, 16, 16, 3, 19, 15, 12, 16, 16, 16, 19, 16, 19, 19, 16, 20, 20, 16, 19, 19, 19, 19, 19, 19, 16, 16, 3, 19, 20, 16, 3, 20, 16, 12, 19, 12, 16, 19, 19, 19, 19, 16, 20, 19, 19, 20, 16, 16, 16, 3, 29, 12, 19, 16, 16, 16, 16, 19, 19, 15, 20, 12, 20, 19, 19, 20, 19, 16, 16, 15, 19, 19, 20, 20, 19, 20, 20, 16])
Pipeline
is not just useful for abstraction but also easier to maintain, persist and deploy.virtualenv
and note the version numbers of the libraries that you are using.(See the Notebook 0)partial_fit
function for online learning. (SGD Classifier, Perceptron, MultinomialNB). If you have incremental data that you want to improve your classifier over time, you may want to persist your models and then use partial_fit
to improve them when you have new data. Works like a charm!