An intro to Applied Machine Learning in Python

Pierre-Luc Bacon

How to Python at McGill¶

Python is great because it has so many packages. However, you will quickly discover that installing packages without administrative privileges can become challenging. There is virtualenv that lets you install packages locally but it can be a nightmare when you have C librairies to build.

If you want to do scientific computing with Python, I advise you to go with the Anaconda Python distribution. It will save you (and the SOCS sysadmins) a great deal of time. To install Anaconda on Linux for 64 bits machines, it would look like:

In [1]:

%%bash
wget http://repo.continuum.io/archive/Anaconda-2.0.1-Linux-x86_64.sh
chmod +x Anaconda-2.0.1-Linux-x86_64.sh
./Anaconda-2.0.1-Linux-x86_64.sh
conda update conda
conda update ipython

IPython notebook can then be started from the command line with ipython notebook --pylab=inline in a directory where you want to store your notebooks.

The magic of IPython¶

If you miss the interactive style of the Matlab prompt, it's a good idea to use the IPython shell or the web-based IPython notebook. Notebooks embody the idea of literate programming interleaving text, math rendering, executable code, plots and other forms of medias. It's also a great way to embrace reproducibility ! Download this notebook and see for yourself.

Although IPython has the python suffix in its name, its modular design allows for a great deal of interoperability with other languages. For example, you can call bash through the so-called cell magics. For example:

In [2]:

%pwd

Out[2]:

u'/Users/pierrelucbacon/workspace/notebooks'

In order to capture the output of a bash cell into Python variables, you could also do:

In [3]:

%%bash --out output --err error
echo $RANDOM

In [4]:

print output

You can obtain the list of supported cell magics through %lsmagic.

In [5]:

%lsmagic

Out[5]:

Available line magics:
%alias  %alias_magic  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %colors  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %install_default_config  %install_ext  %install_profiles  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %man  %matplotlib  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %popd  %pprint  %precision  %profile  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%debug  %%file  %%html  %%javascript  %%latex  %%perl  %%prun  %%pypy  %%python  %%python2  %%python3  %%ruby  %%script  %%sh  %%svg  %%sx  %%system  %%time  %%timeit  %%writefile

Automagic is ON, % prefix IS NOT needed for line magics.

Another useful cell magic is also %timeit, giving you an estimate of the time spent executing a given function. For example:

In [6]:

%timeit np.linalg.eigvals(np.random.rand(50,50))

1000 loops, best of 3: 1.12 ms per loop

You can learn more about various cell magics in this notebook

Talking to R¶

If you have R installed, IPython is able to execute a cell in R and return the result to Python. This can be useful if you find that a particular library was written only for R while your project uses mainly Python.

In [7]:

%%bash
pip install rpy2

In [8]:

%load_ext rpy2.ipython

In [9]:

x = np.linspace(0,10,50)
y = 0.9*x + np.random.randn(50)
plt.plot(x,y,'.')

Out[9]:

[<matplotlib.lines.Line2D at 0x10b355dd0>]

In [10]:

%Rpush x y
%R lm(y~x)$coef

Out[10]:

<FloatVector - Python:0x10b17dfc8 / R:0x102e54240>
[0.412314, 0.830338]

You can learn more about the rmagic in this notebook.

Talking to Matlab¶

The pymatbridge module and cell magic allows you to spawn matlab in the background and feed it commands from IPython. This extension is not part of the core IPython and you will also need to install it manually.

In [11]:

%load_ext pymatbridge

Starting MATLAB on http://localhost:53704
 visit http://localhost:53704/exit.m to shut down same
.....MATLAB started and connected!

In [12]:

%%matlab 
x = linspace(0.01,6*pi,100);
plot(sin(x))

Numpy basics¶

Numpy is a Python library which provides matrix manipulation capabilities and other standard numerical algorithms. You can load it under the np namespace with:

In [13]:

import numpy as np

Creating arrays¶

You can create a vector from an existing tuple:

In [14]:

x = np.array([1,2,3,4])
print x
print x.shape

[1 2 3 4]
(4,)

Similarly, you can define a numpy matrix as:

In [15]:

X = np.array([[1,2], [3,4]])
print X
print X.shape

[[1 2]
 [3 4]]
(2, 2)

You can also easily generate points on an interval through linspace or logspace:

In [16]:

x = np.linspace(0, 10, 25)
print x

[  0.           0.41666667   0.83333333   1.25         1.66666667
   2.08333333   2.5          2.91666667   3.33333333   3.75         4.16666667
   4.58333333   5.           5.41666667   5.83333333   6.25         6.66666667
   7.08333333   7.5          7.91666667   8.33333333   8.75         9.16666667
   9.58333333  10.        ]

The standards matrices can be generated with:

In [17]:

np.zeros((5,5))

Out[17]:

array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])

In [18]:

np.ones((5,5))

Out[18]:

array([[ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.]])

In [19]:

np.eye(5)

Out[19]:

array([[ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  1.]])

Indexing¶

As opposed to Matlab, Numpy is 0-indexed. Indexing can be performed efficiently through slicing.

In [20]:

X = np.random.randn(4,4)
print X

[[-0.3391406   0.62073092 -0.2864898   1.05243832]
 [ 0.92636541  0.78939298  0.00748023  0.8718698 ]
 [ 0.01819851 -0.15747068 -0.07151959  0.58040635]
 [-1.76144683  0.1659408  -0.9622996  -0.65804837]]

Selecting all rows except the first and all columns:

In [21]:

X[1:,:]

Out[21]:

array([[ 0.92636541,  0.78939298,  0.00748023,  0.8718698 ],
       [ 0.01819851, -0.15747068, -0.07151959,  0.58040635],
       [-1.76144683,  0.1659408 , -0.9622996 , -0.65804837]])

Select columns 2 and 0 and all rows:

In [22]:

X[:,[2,0]]

Out[22]:

array([[-0.2864898 , -0.3391406 ],
       [ 0.00748023,  0.92636541],
       [-0.07151959,  0.01819851],
       [-0.9622996 , -1.76144683]])

-1 (minus one) stands for the last element in a given dimension

In [23]:

X[-1,:]

Out[23]:

array([-1.76144683,  0.1659408 , -0.9622996 , -0.65804837])

A powerful indexing feature can also be achieved through logical indexing. You can create a boolean array from an ndarray:

In [24]:

X < 0.5

Out[24]:

array([[ True, False,  True, False],
       [False, False,  True, False],
       [ True,  True,  True, False],
       [ True,  True,  True,  True]], dtype=bool)

The boolean ndarray can then be used for indexing:

In [25]:

X[X < 0.5]

Out[25]:

array([-0.3391406 , -0.2864898 ,  0.00748023,  0.01819851, -0.15747068,
       -0.07151959, -1.76144683,  0.1659408 , -0.9622996 , -0.65804837])

Numpy for Matlab users¶

If you know Matlab already, make sure to look at the cheatsheet Numpy for matlab users.

Numpy I/O¶

You can import and export data from a variety of formats. For example, you can save a numpy array to a text format using np.savetxt.

In [26]:

X = np.random.normal(size=(5,5))
np.savetxt('data.txt', X)

In [27]:

%cat data.txt

8.156678852559566817e-01 -3.131124534131186632e-01 8.337257398597310853e-01 9.214720052339393508e-01 1.085875976924086217e+00
-5.035982362246711475e-01 -1.786739101987857847e-01 -8.520783372461552263e-01 2.078405838941234929e-01 1.096214780850903514e+00
-6.567926252898417250e-01 -7.454343453128293717e-01 -1.143679949305732579e+00 3.589055086963070518e-01 1.382764023257532227e+00
4.030407399756600895e-01 -4.347724112286717180e-01 -3.709554624484467933e-02 -1.868833136961473906e+00 1.845541086337965653e-01
-9.785313966063215185e-02 8.348666702977869392e-01 -1.781600107757391305e+00 -2.223242216704917340e-01 1.715788698981550509e+00

In [28]:

X = np.loadtxt('data.txt')

Unless you want your data files to be human readable in text form, it is usually preferable to use http://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html rather than np.savetxt. The array will then be saved in a more compact Numpy (.npy) format.

Dealing with matfiles¶

If you can export Matlab matrices in versions v4 (Level 1.0), v6 or v7 to 7.2, then you would be able to use scipy.io.loadmat to load your data in numpy.

In [29]:

%%matlab
x = magic(4);
save('test.mat', 'x')

In [30]:

from scipy.io import loadmat
print loadmat('test.mat')

{'x': array([[16,  2,  3, 13],
       [ 5, 11, 10,  8],
       [ 9,  7,  6, 12],
       [ 4, 14, 15,  1]], dtype=uint8), '__version__': '1.0', '__header__': 'MATLAB 5.0 MAT-file, Platform: MACI64, Created on: Mon Sep 22 23:21:18 2014', '__globals__': []}

The corresponding scipy.io.savemat is also available.

Scikit-learn¶

Scikit-learn offers a variety of classification, regression and clustering algorithms under a unified interface. If you decided to install Anaconda, it should already be available. Standard datasets are shipped with Scikit. They can be quite useful for debugging your algorithm. We will load the iris dataset with:

In [31]:

from sklearn import linear_model, datasets

iris = datasets.load_iris()
X = iris.data[:,:]
Y = iris.target

Scikit-learn provides different cross-validation techniques. In the following StratifiedKFold is an iterator which outputs pairs of indices for the train and test instances. You then use these indices to slice your dataset.

In [32]:

from sklearn.cross_validation import StratifiedKFold
from sklearn.metrics import mean_squared_error

skf = StratifiedKFold(Y, 3)
logreg = linear_model.LogisticRegression(C=1e5)

for train, test in skf:
    ypred = logreg.fit(X[train], Y[train]).predict(X[test])
    print mean_squared_error(Y[test], ypred)

0.0196078431373
0.117647058824
0.0

The above loop can also be written more succinctly using cross_val_score. By specifying n_jobs=k, you can compute the cross-validation loop in parallel in k processes. If k=-1, all CPUs are used.

In [33]:

from sklearn.cross_validation import cross_val_score
cross_val_score(logreg, X, Y, cv=skf, scoring='mean_squared_error', n_jobs=-1)

Out[33]:

array([-0.01960784, -0.11764706, -0.        ])

Model selection¶

Using cross-validated scores, you can perform model selection efficiently through GridSearchCV. In the following, we will load the digits dataset and performe grid search over the parameters of an SVM classifier.

In [34]:

from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report

digits = datasets.load_digits()
n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=0)

tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

GridSearchCV takes an estimator (regression, classification or clustering), a dictionary of parameters to try for that particular estimator, a cross-validation iterator and a scoring function. Once again, the parameter search can be performed in parallel with n_jobs=-1. The fit function of GridSearchCV then scores every possible combination of the tuned_parameters and saves the best one in the best_estimator_ attribute.

In [35]:

skf = StratifiedKFold(Y, 5)
clf = GridSearchCV(SVC(C=1), tuned_parameters, cv=skf, n_jobs=-1)
clf.fit(X_train, y_train)
print clf.best_estimator_
print "Grid scores on train set:"
for params, mean_score, scores in clf.grid_scores_:
    print("%0.3f (+/-%0.03f) for %r"%(mean_score, scores.std() / 2, params))

SVC(C=10, cache_size=200, class_weight=None, coef0=0.0, degree=3,
  gamma=0.0001, kernel='rbf', max_iter=-1, probability=False,
  random_state=None, shrinking=True, tol=0.001, verbose=False)
Grid scores on train set:
0.933 (+/-0.018) for {'kernel': 'rbf', 'C': 1, 'gamma': 0.001}
0.767 (+/-0.018) for {'kernel': 'rbf', 'C': 1, 'gamma': 0.0001}
0.933 (+/-0.018) for {'kernel': 'rbf', 'C': 10, 'gamma': 0.001}
0.940 (+/-0.012) for {'kernel': 'rbf', 'C': 10, 'gamma': 0.0001}
0.933 (+/-0.018) for {'kernel': 'rbf', 'C': 100, 'gamma': 0.001}
0.940 (+/-0.012) for {'kernel': 'rbf', 'C': 100, 'gamma': 0.0001}
0.933 (+/-0.018) for {'kernel': 'rbf', 'C': 1000, 'gamma': 0.001}
0.940 (+/-0.012) for {'kernel': 'rbf', 'C': 1000, 'gamma': 0.0001}
0.940 (+/-0.012) for {'kernel': 'linear', 'C': 1}
0.940 (+/-0.012) for {'kernel': 'linear', 'C': 10}
0.940 (+/-0.012) for {'kernel': 'linear', 'C': 100}
0.940 (+/-0.012) for {'kernel': 'linear', 'C': 1000}

The classification performance of the best estimator can then be reported with classification_report.

In [36]:

y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))

             precision    recall  f1-score   support

          0       1.00      1.00      1.00        89
          1       0.96      1.00      0.98        90
          2       1.00      0.99      0.99        92
          3       0.99      0.99      0.99        93
          4       1.00      1.00      1.00        76
          5       0.97      0.97      0.97       108
          6       0.99      0.99      0.99        89
          7       0.99      1.00      0.99        78
          8       0.99      0.96      0.97        92
          9       0.99      0.98      0.98        92

avg / total       0.99      0.99      0.99       899

Dealing with large parameter sets¶

It might happen that the set of parameters over which you would hope to run GridSearchCV would be too large. We could then decide to put a prior or the parameters and perform a stochastic search. RandomizedSearchCV supports any of the standard distribution from scipy.stats as priors. In this example, we will set an exponential priors over the C and gamma parameters.

In [37]:

import scipy.stats
from sklearn.grid_search import RandomizedSearchCV

tuned_parameters = {'C': scipy.stats.expon(scale=100), 'gamma': scipy.stats.expon(scale=.1),
  'kernel': ['rbf'], 'class_weight':['auto', None]}

clf = RandomizedSearchCV(SVC(C=1), tuned_parameters, n_iter=10, n_jobs=-1)
clf.fit(X_train, y_train)
print clf.best_estimator_
y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))

SVC(C=77.5702758756, cache_size=200, class_weight='auto', coef0=0.0, degree=3,
  gamma=0.00989451451095, kernel='rbf', max_iter=-1, probability=False,
  random_state=None, shrinking=True, tol=0.001, verbose=False)
             precision    recall  f1-score   support

          0       1.00      0.78      0.87        89
          1       1.00      0.63      0.78        90
          2       1.00      0.57      0.72        92
          3       1.00      0.72      0.84        93
          4       0.19      1.00      0.32        76
          5       1.00      0.27      0.42       108
          6       1.00      0.78      0.87        89
          7       1.00      0.91      0.95        78
          8       1.00      0.29      0.45        92
          9       1.00      0.68      0.81        92

avg / total       0.93      0.65      0.70       899

Fancier model selection¶

Hyperopt goes beyond the randomized and grid search methods and provides fancier hyperparameters optimization. This package is quite popular in the neural net communities.

Plotting the confusion matrix¶

The confusion matrix is a useful visual tool to asess the classification performance. In the ideal case, you would see only entries on the diagonal. In this case, every label would have been predicted perfectly.

In [38]:

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_true, y_pred)
plt.matshow(cm)

Out[38]:

<matplotlib.image.AxesImage at 0x10dfdb410>

Generating an ROC curve¶

Wee can compute the Receiver Operating Characteristic (ROC) curve to assess the quality of a binary classifier. In the following, we will use a synthetic binary classification task from the Hastie textbook. It consists of 12000 instances, each with 10 Gaussian features. We will make use of the probabilistic interpretation of logistic regression in order to produce the ROC curve.

In [39]:

from sklearn.datasets import make_hastie_10_2 
from sklearn.linear_model import LogisticRegression

X, y = make_hastie_10_2()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

tuned_parameters = [{'penalty': ['l1', 'l2'],
                     'C': [1, 10, 100, 1000]}]

logreg = LogisticRegression()
clf = GridSearchCV(logreg, tuned_parameters)
clf.fit(X_train, y_train)
clf.best_score_

Out[39]:

0.49083333333333334

The roc_curve function takes as inputs the true labels for the binary classification as well as the probability estimate of the dominant class. We use np.max with axis=1 to compute the max across the columns (the second axis).

In [40]:

from sklearn.metrics import roc_curve
y_pred = clf.predict_proba(X_test)
y_score = np.max(y_pred, axis=1)
fpr, tpr, th = roc_curve(y_test, y_score)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.plot(fpr, tpr)

Out[40]:

[<matplotlib.lines.Line2D at 0x10df08f10>]

This is a pretty bad ROC curve.

Plotting the learning curve¶

sklearn.learning_curve.learning_curve is another one-liner which might be useful for illustrating the learning behavior of a given a model (for a fixed choice of parameters). Building upon our last example, we will plot the learning curve of the best model that we found through grid search.

In [41]:

from sklearn.learning_curve import learning_curve
from sklearn.cross_validation import ShuffleSplit

cv = ShuffleSplit(digits.data.shape[0], n_iter=10,
                                   test_size=0.2, random_state=0)

train_sizes, train_scores, test_scores = learning_curve(clf.best_estimator_, X, y, cv=cv,
                                                        train_sizes=np.linspace(.1, 1.0, 5))

We can use the fill_between plotting function of Matplotlib to show the standard deviation in the score estimates.

In [42]:

train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

plt.legend(loc="best")

Out[42]:

<matplotlib.legend.Legend at 0x10fa93650>

One-hot encoding of categorical features¶

Scikit-learn also offers convenient preprocessing functions for the most common cases. Let say that you have categorial data of the following kind:

In [43]:

measurements = [
     {'month': 'January', 'temperature': 33.},
     {'month': 'March', 'temperature': 12.},
     {'month': 'May', 'temperature': 18.},
     {'month': 'January', 'temperature': 12.},
 ]

DictVectorizer can be used to transform lists of dictionaries. Categorial variables are transformed using one-hot encoding.

In [44]:

from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
vec.fit_transform(measurements).toarray()

Out[44]:

array([[  1.,   0.,   0.,  33.],
       [  0.,   1.,   0.,  12.],
       [  0.,   0.,   1.,  18.],
       [  1.,   0.,   0.,  12.]])

Vector quantization¶

If you need to discretize a continuous variable into a smaller set of discrete values, a possible approach is to use K-means. The fit_predict functions computes the clusters and returns the closest cluster centroid for every input point.

In [45]:

from sklearn.cluster import KMeans

clst = KMeans(n_clusters=4)
centroids = clst.fit_predict(iris.data).reshape(iris.data.shape[0], 1)
print iris.data[-10:]
print centroids[-10:]

[[ 6.7  3.1  5.6  2.4]
 [ 6.9  3.1  5.1  2.3]
 [ 5.8  2.7  5.1  1.9]
 [ 6.8  3.2  5.9  2.3]
 [ 6.7  3.3  5.7  2.5]
 [ 6.7  3.   5.2  2.3]
 [ 6.3  2.5  5.   1.9]
 [ 6.5  3.   5.2  2. ]
 [ 6.2  3.4  5.4  2.3]
 [ 5.9  3.   5.1  1.8]]
[[2]
 [2]
 [0]
 [2]
 [2]
 [2]
 [0]
 [0]
 [2]
 [0]]

We can further transform this representation into one-hot vectors if needed.

In [47]:

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
enc.fit_transform(centroids).toarray()[-10:]

Out[47]:

array([[ 0.,  0.,  1.,  0.],
       [ 0.,  0.,  1.,  0.],
       [ 1.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.],
       [ 0.,  0.,  1.,  0.],
       [ 0.,  0.,  1.,  0.],
       [ 1.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.],
       [ 1.,  0.,  0.,  0.]])

Scaling¶

Certain algorithms require the data to be distributed more or less like the normal distribution. The process of transforming the data so that it has unit variance and zero mean is called standardization. We can apply standardization in one line with Scikit-learn:

In [48]:

from sklearn.preprocessing import scale

mu1 = np.array([0.5, 1.5])
sigma1 = np.array([[1.0, 0.5], [0.5, 1.0]])
X = np.random.multivariate_normal(mu1, sigma1, 1000)
plt.plot(X[:,0], X[:,1], '.')

Xs = scale(X)
plt.plot(Xs[:,0], Xs[:,1], '.')

Out[48]:

[<matplotlib.lines.Line2D at 0x10fded290>]

Whitening¶

It is sometimes necessary to transform the data so that the covariance is the identity, a decorrelation procedure called whitening. For example, given some points generated from a multivariate distribution:

In [49]:

mu1 = np.array([0.0, 0.0])
sigma1 = np.array([[1.0, 0.5], [0.5, 1.0]])
X = np.random.multivariate_normal(mu1, sigma1, 1000)
plt.plot(X[:,0], X[:,1], '.')

Out[49]:

[<matplotlib.lines.Line2D at 0x10ff1c190>]

We can sphere or whiten then data through PCA:

In [50]:

from sklearn.decomposition import PCA
pca = PCA(whiten=True)
Xwhiten = pca.fit_transform(X)
plt.plot(Xwhiten[:,0], Xwhiten[:,1], '.')

Out[50]:

[<matplotlib.lines.Line2D at 0x110083050>]

Counting word frequencies¶

When working with text data, you often have to estimate empirical probabilities. CountVectorizer allows you to write a one-liner for this task. We will use the 20 Newsgroup dataset to illustate this usecase.

In [51]:

from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups()

In [52]:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(twenty_train.data)
vectorizer.vocabulary_.get(u'algorithm')

Out[52]:

By default, CountVectorizer only extracts frequencies for single tokens. When looking at pair or triples of tokens, we then talk about n-grams models. CountVectorizer takes a ngram_range argument specifying the minimum and maximum order of the n-grams to extract.

The scikit-learn documentation has a dedicated page on text data. Make sure to take a look !

Oh, and here's a fun exercise ! Using CountVectorizer, estimate the empirical probabilies of some text corpus and sample sequences of words starting from an initial word. This would give you a primitive Markov Chain approach to text generation.

Learning more¶

NLTK for more advanced Natural Language Processing (NLP) tools in Python
statsmodels for autoregressive models
Pandas for data gymnastics on structured datasets
matplotlib to learn how to generate fancy plots
networkx for representing and operating on graphs
Starcluster, Statsmodels and Pandas another tutorial that I wrote on using Starcluster on Amazon EC2 in combination with statsmodels and Pandas
Diving into Open Data with IPython Notebook & Pandas on using Pandas on the Bixi dataset. (Btw, Julia is a former student of Prakash).
How to convert notebooks to Latex A great way to submit your assignments !