Python is great because it has so many packages. However, you will quickly discover that installing packages without administrative privileges can become challenging. There is virtualenv that lets you install packages locally but it can be a nightmare when you have C librairies to build.
If you want to do scientific computing with Python, I advise you to go with the Anaconda Python distribution. It will save you (and the SOCS sysadmins) a great deal of time. To install Anaconda on Linux for 64 bits machines, it would look like:
%%bash
wget http://repo.continuum.io/archive/Anaconda-2.0.1-Linux-x86_64.sh
chmod +x Anaconda-2.0.1-Linux-x86_64.sh
./Anaconda-2.0.1-Linux-x86_64.sh
conda update conda
conda update ipython
IPython notebook can then be started from the command line with ipython notebook --pylab=inline
in a directory where you want to store your notebooks.
If you miss the interactive style of the Matlab prompt, it's a good idea to use the IPython shell or the web-based IPython notebook. Notebooks embody the idea of literate programming interleaving text, math rendering, executable code, plots and other forms of medias. It's also a great way to embrace reproducibility ! Download this notebook and see for yourself.
Although IPython has the python suffix in its name, its modular design allows for a great deal of interoperability with other languages. For example, you can call bash through the so-called cell magics. For example:
%pwd
u'/Users/pierrelucbacon/workspace/notebooks'
In order to capture the output of a bash cell into Python variables, you could also do:
%%bash --out output --err error
echo $RANDOM
print output
22358
You can obtain the list of supported cell magics through %lsmagic
.
%lsmagic
Available line magics: %alias %alias_magic %autocall %automagic %autosave %bookmark %cat %cd %clear %colors %config %connect_info %cp %debug %dhist %dirs %doctest_mode %ed %edit %env %gui %hist %history %install_default_config %install_ext %install_profiles %killbgscripts %ldir %less %lf %lk %ll %load %load_ext %loadpy %logoff %logon %logstart %logstate %logstop %ls %lsmagic %lx %macro %magic %man %matplotlib %mkdir %more %mv %notebook %page %pastebin %pdb %pdef %pdoc %pfile %pinfo %pinfo2 %popd %pprint %precision %profile %prun %psearch %psource %pushd %pwd %pycat %pylab %qtconsole %quickref %recall %rehashx %reload_ext %rep %rerun %reset %reset_selective %rm %rmdir %run %save %sc %store %sx %system %tb %time %timeit %unalias %unload_ext %who %who_ls %whos %xdel %xmode Available cell magics: %%! %%HTML %%SVG %%bash %%capture %%debug %%file %%html %%javascript %%latex %%perl %%prun %%pypy %%python %%python2 %%python3 %%ruby %%script %%sh %%svg %%sx %%system %%time %%timeit %%writefile Automagic is ON, % prefix IS NOT needed for line magics.
Another useful cell magic is also %timeit
, giving you an estimate of the time spent executing a given function. For example:
%timeit np.linalg.eigvals(np.random.rand(50,50))
1000 loops, best of 3: 1.12 ms per loop
You can learn more about various cell magics in this notebook
If you have R installed, IPython is able to execute a cell in R and return the result to Python. This can be useful if you find that a particular library was written only for R while your project uses mainly Python.
%%bash
pip install rpy2
%load_ext rpy2.ipython
x = np.linspace(0,10,50)
y = 0.9*x + np.random.randn(50)
plt.plot(x,y,'.')
[<matplotlib.lines.Line2D at 0x10b355dd0>]
%Rpush x y
%R lm(y~x)$coef
<FloatVector - Python:0x10b17dfc8 / R:0x102e54240> [0.412314, 0.830338]
You can learn more about the rmagic in this notebook.
The pymatbridge module and cell magic allows you to spawn matlab in the background and feed it commands from IPython. This extension is not part of the core IPython and you will also need to install it manually.
%load_ext pymatbridge
Starting MATLAB on http://localhost:53704 visit http://localhost:53704/exit.m to shut down same .....MATLAB started and connected!
%%matlab
x = linspace(0.01,6*pi,100);
plot(sin(x))
Numpy is a Python library which provides matrix manipulation capabilities and other standard numerical algorithms. You can load it under the np
namespace with:
import numpy as np
You can create a vector from an existing tuple:
x = np.array([1,2,3,4])
print x
print x.shape
[1 2 3 4] (4,)
Similarly, you can define a numpy matrix as:
X = np.array([[1,2], [3,4]])
print X
print X.shape
[[1 2] [3 4]] (2, 2)
x = np.linspace(0, 10, 25)
print x
[ 0. 0.41666667 0.83333333 1.25 1.66666667 2.08333333 2.5 2.91666667 3.33333333 3.75 4.16666667 4.58333333 5. 5.41666667 5.83333333 6.25 6.66666667 7.08333333 7.5 7.91666667 8.33333333 8.75 9.16666667 9.58333333 10. ]
The standards matrices can be generated with:
np.zeros((5,5))
array([[ 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0.]])
np.ones((5,5))
array([[ 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1.]])
np.eye(5)
array([[ 1., 0., 0., 0., 0.], [ 0., 1., 0., 0., 0.], [ 0., 0., 1., 0., 0.], [ 0., 0., 0., 1., 0.], [ 0., 0., 0., 0., 1.]])
As opposed to Matlab, Numpy is 0-indexed. Indexing can be performed efficiently through slicing.
X = np.random.randn(4,4)
print X
[[-0.3391406 0.62073092 -0.2864898 1.05243832] [ 0.92636541 0.78939298 0.00748023 0.8718698 ] [ 0.01819851 -0.15747068 -0.07151959 0.58040635] [-1.76144683 0.1659408 -0.9622996 -0.65804837]]
Selecting all rows except the first and all columns:
X[1:,:]
array([[ 0.92636541, 0.78939298, 0.00748023, 0.8718698 ], [ 0.01819851, -0.15747068, -0.07151959, 0.58040635], [-1.76144683, 0.1659408 , -0.9622996 , -0.65804837]])
Select columns 2 and 0 and all rows:
X[:,[2,0]]
array([[-0.2864898 , -0.3391406 ], [ 0.00748023, 0.92636541], [-0.07151959, 0.01819851], [-0.9622996 , -1.76144683]])
-1
(minus one) stands for the last element in a given dimension
X[-1,:]
array([-1.76144683, 0.1659408 , -0.9622996 , -0.65804837])
A powerful indexing feature can also be achieved through logical indexing. You can create a boolean array from an ndarray:
X < 0.5
array([[ True, False, True, False], [False, False, True, False], [ True, True, True, False], [ True, True, True, True]], dtype=bool)
The boolean ndarray can then be used for indexing:
X[X < 0.5]
array([-0.3391406 , -0.2864898 , 0.00748023, 0.01819851, -0.15747068, -0.07151959, -1.76144683, 0.1659408 , -0.9622996 , -0.65804837])
If you know Matlab already, make sure to look at the cheatsheet Numpy for matlab users.
You can import and export data from a variety of formats. For example, you can save a numpy array to a text format using np.savetxt.
X = np.random.normal(size=(5,5))
np.savetxt('data.txt', X)
%cat data.txt
8.156678852559566817e-01 -3.131124534131186632e-01 8.337257398597310853e-01 9.214720052339393508e-01 1.085875976924086217e+00 -5.035982362246711475e-01 -1.786739101987857847e-01 -8.520783372461552263e-01 2.078405838941234929e-01 1.096214780850903514e+00 -6.567926252898417250e-01 -7.454343453128293717e-01 -1.143679949305732579e+00 3.589055086963070518e-01 1.382764023257532227e+00 4.030407399756600895e-01 -4.347724112286717180e-01 -3.709554624484467933e-02 -1.868833136961473906e+00 1.845541086337965653e-01 -9.785313966063215185e-02 8.348666702977869392e-01 -1.781600107757391305e+00 -2.223242216704917340e-01 1.715788698981550509e+00
X = np.loadtxt('data.txt')
Unless you want your data files to be human readable in text form, it is usually preferable to use http://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html rather than np.savetxt
. The array will then be saved in a more compact Numpy (.npy
) format.
If you can export Matlab matrices in versions v4 (Level 1.0), v6 or v7 to 7.2, then you would be able to use scipy.io.loadmat to load your data in numpy.
%%matlab
x = magic(4);
save('test.mat', 'x')
from scipy.io import loadmat
print loadmat('test.mat')
{'x': array([[16, 2, 3, 13], [ 5, 11, 10, 8], [ 9, 7, 6, 12], [ 4, 14, 15, 1]], dtype=uint8), '__version__': '1.0', '__header__': 'MATLAB 5.0 MAT-file, Platform: MACI64, Created on: Mon Sep 22 23:21:18 2014', '__globals__': []}
The corresponding scipy.io.savemat is also available.
Scikit-learn offers a variety of classification, regression and clustering algorithms under a unified interface. If you decided to install Anaconda, it should already be available. Standard datasets are shipped with Scikit. They can be quite useful for debugging your algorithm. We will load the iris dataset with:
from sklearn import linear_model, datasets
iris = datasets.load_iris()
X = iris.data[:,:]
Y = iris.target
Scikit-learn provides different cross-validation techniques. In the following StratifiedKFold
is an iterator which outputs pairs of indices for the train and test instances. You then use these indices to slice your dataset.
from sklearn.cross_validation import StratifiedKFold
from sklearn.metrics import mean_squared_error
skf = StratifiedKFold(Y, 3)
logreg = linear_model.LogisticRegression(C=1e5)
for train, test in skf:
ypred = logreg.fit(X[train], Y[train]).predict(X[test])
print mean_squared_error(Y[test], ypred)
0.0196078431373 0.117647058824 0.0
The above loop can also be written more succinctly using cross_val_score. By specifying n_jobs=k
, you can compute the cross-validation loop in parallel in k
processes. If k=-1
, all CPUs are used.
from sklearn.cross_validation import cross_val_score
cross_val_score(logreg, X, Y, cv=skf, scoring='mean_squared_error', n_jobs=-1)
array([-0.01960784, -0.11764706, -0. ])
Using cross-validated scores, you can perform model selection efficiently through GridSearchCV. In the following, we will load the digits dataset and performe grid search over the parameters of an SVM classifier.
from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
digits = datasets.load_digits()
n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=0)
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
'C': [1, 10, 100, 1000]},
{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
GridSearchCV
takes an estimator (regression, classification or clustering), a dictionary of parameters to try for that particular estimator, a cross-validation iterator and a scoring function. Once again, the parameter search can be performed in parallel with n_jobs=-1
. The fit
function of GridSearchCV
then scores every possible combination of the tuned_parameters
and saves the best one in the best_estimator_
attribute.
skf = StratifiedKFold(Y, 5)
clf = GridSearchCV(SVC(C=1), tuned_parameters, cv=skf, n_jobs=-1)
clf.fit(X_train, y_train)
print clf.best_estimator_
print "Grid scores on train set:"
for params, mean_score, scores in clf.grid_scores_:
print("%0.3f (+/-%0.03f) for %r"%(mean_score, scores.std() / 2, params))
SVC(C=10, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0001, kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) Grid scores on train set: 0.933 (+/-0.018) for {'kernel': 'rbf', 'C': 1, 'gamma': 0.001} 0.767 (+/-0.018) for {'kernel': 'rbf', 'C': 1, 'gamma': 0.0001} 0.933 (+/-0.018) for {'kernel': 'rbf', 'C': 10, 'gamma': 0.001} 0.940 (+/-0.012) for {'kernel': 'rbf', 'C': 10, 'gamma': 0.0001} 0.933 (+/-0.018) for {'kernel': 'rbf', 'C': 100, 'gamma': 0.001} 0.940 (+/-0.012) for {'kernel': 'rbf', 'C': 100, 'gamma': 0.0001} 0.933 (+/-0.018) for {'kernel': 'rbf', 'C': 1000, 'gamma': 0.001} 0.940 (+/-0.012) for {'kernel': 'rbf', 'C': 1000, 'gamma': 0.0001} 0.940 (+/-0.012) for {'kernel': 'linear', 'C': 1} 0.940 (+/-0.012) for {'kernel': 'linear', 'C': 10} 0.940 (+/-0.012) for {'kernel': 'linear', 'C': 100} 0.940 (+/-0.012) for {'kernel': 'linear', 'C': 1000}
The classification performance of the best estimator can then be reported with classification_report.
y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))
precision recall f1-score support 0 1.00 1.00 1.00 89 1 0.96 1.00 0.98 90 2 1.00 0.99 0.99 92 3 0.99 0.99 0.99 93 4 1.00 1.00 1.00 76 5 0.97 0.97 0.97 108 6 0.99 0.99 0.99 89 7 0.99 1.00 0.99 78 8 0.99 0.96 0.97 92 9 0.99 0.98 0.98 92 avg / total 0.99 0.99 0.99 899
It might happen that the set of parameters over which you would hope to run GridSearchCV
would be too large. We could then decide to put a prior or the parameters and perform a stochastic search. RandomizedSearchCV supports any of the standard distribution from scipy.stats as priors. In this example, we will set an exponential priors over the C
and gamma
parameters.
import scipy.stats
from sklearn.grid_search import RandomizedSearchCV
tuned_parameters = {'C': scipy.stats.expon(scale=100), 'gamma': scipy.stats.expon(scale=.1),
'kernel': ['rbf'], 'class_weight':['auto', None]}
clf = RandomizedSearchCV(SVC(C=1), tuned_parameters, n_iter=10, n_jobs=-1)
clf.fit(X_train, y_train)
print clf.best_estimator_
y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))
SVC(C=77.5702758756, cache_size=200, class_weight='auto', coef0=0.0, degree=3, gamma=0.00989451451095, kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) precision recall f1-score support 0 1.00 0.78 0.87 89 1 1.00 0.63 0.78 90 2 1.00 0.57 0.72 92 3 1.00 0.72 0.84 93 4 0.19 1.00 0.32 76 5 1.00 0.27 0.42 108 6 1.00 0.78 0.87 89 7 1.00 0.91 0.95 78 8 1.00 0.29 0.45 92 9 1.00 0.68 0.81 92 avg / total 0.93 0.65 0.70 899
The confusion matrix is a useful visual tool to asess the classification performance. In the ideal case, you would see only entries on the diagonal. In this case, every label would have been predicted perfectly.
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_true, y_pred)
plt.matshow(cm)
<matplotlib.image.AxesImage at 0x10dfdb410>
Wee can compute the Receiver Operating Characteristic (ROC) curve to assess the quality of a binary classifier. In the following, we will use a synthetic binary classification task from the Hastie textbook. It consists of 12000 instances, each with 10 Gaussian features. We will make use of the probabilistic interpretation of logistic regression in order to produce the ROC curve.
from sklearn.datasets import make_hastie_10_2
from sklearn.linear_model import LogisticRegression
X, y = make_hastie_10_2()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
tuned_parameters = [{'penalty': ['l1', 'l2'],
'C': [1, 10, 100, 1000]}]
logreg = LogisticRegression()
clf = GridSearchCV(logreg, tuned_parameters)
clf.fit(X_train, y_train)
clf.best_score_
0.49083333333333334
The roc_curve function takes as inputs the true labels for the binary classification as well as the probability estimate of the dominant class. We use np.max
with axis=1
to compute the max across the columns (the second axis).
from sklearn.metrics import roc_curve
y_pred = clf.predict_proba(X_test)
y_score = np.max(y_pred, axis=1)
fpr, tpr, th = roc_curve(y_test, y_score)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.plot(fpr, tpr)
[<matplotlib.lines.Line2D at 0x10df08f10>]
This is a pretty bad ROC curve.
sklearn.learning_curve.learning_curve is another one-liner which might be useful for illustrating the learning behavior of a given a model (for a fixed choice of parameters). Building upon our last example, we will plot the learning curve of the best model that we found through grid search.
from sklearn.learning_curve import learning_curve
from sklearn.cross_validation import ShuffleSplit
cv = ShuffleSplit(digits.data.shape[0], n_iter=10,
test_size=0.2, random_state=0)
train_sizes, train_scores, test_scores = learning_curve(clf.best_estimator_, X, y, cv=cv,
train_sizes=np.linspace(.1, 1.0, 5))
We can use the fill_between plotting function of Matplotlib to show the standard deviation in the score estimates.
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
<matplotlib.legend.Legend at 0x10fa93650>
Scikit-learn also offers convenient preprocessing functions for the most common cases. Let say that you have categorial data of the following kind:
measurements = [
{'month': 'January', 'temperature': 33.},
{'month': 'March', 'temperature': 12.},
{'month': 'May', 'temperature': 18.},
{'month': 'January', 'temperature': 12.},
]
DictVectorizer can be used to transform lists of dictionaries. Categorial variables are transformed using one-hot encoding.
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
vec.fit_transform(measurements).toarray()
array([[ 1., 0., 0., 33.], [ 0., 1., 0., 12.], [ 0., 0., 1., 18.], [ 1., 0., 0., 12.]])
If you need to discretize a continuous variable into a smaller set of discrete values, a possible approach is to use K-means. The fit_predict
functions computes the clusters and returns the closest cluster centroid for every input point.
from sklearn.cluster import KMeans
clst = KMeans(n_clusters=4)
centroids = clst.fit_predict(iris.data).reshape(iris.data.shape[0], 1)
print iris.data[-10:]
print centroids[-10:]
[[ 6.7 3.1 5.6 2.4] [ 6.9 3.1 5.1 2.3] [ 5.8 2.7 5.1 1.9] [ 6.8 3.2 5.9 2.3] [ 6.7 3.3 5.7 2.5] [ 6.7 3. 5.2 2.3] [ 6.3 2.5 5. 1.9] [ 6.5 3. 5.2 2. ] [ 6.2 3.4 5.4 2.3] [ 5.9 3. 5.1 1.8]] [[2] [2] [0] [2] [2] [2] [0] [0] [2] [0]]
We can further transform this representation into one-hot vectors if needed.
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
enc.fit_transform(centroids).toarray()[-10:]
array([[ 0., 0., 1., 0.], [ 0., 0., 1., 0.], [ 1., 0., 0., 0.], [ 0., 0., 1., 0.], [ 0., 0., 1., 0.], [ 0., 0., 1., 0.], [ 1., 0., 0., 0.], [ 1., 0., 0., 0.], [ 0., 0., 1., 0.], [ 1., 0., 0., 0.]])
Certain algorithms require the data to be distributed more or less like the normal distribution. The process of transforming the data so that it has unit variance and zero mean is called standardization. We can apply standardization in one line with Scikit-learn:
from sklearn.preprocessing import scale
mu1 = np.array([0.5, 1.5])
sigma1 = np.array([[1.0, 0.5], [0.5, 1.0]])
X = np.random.multivariate_normal(mu1, sigma1, 1000)
plt.plot(X[:,0], X[:,1], '.')
Xs = scale(X)
plt.plot(Xs[:,0], Xs[:,1], '.')
[<matplotlib.lines.Line2D at 0x10fded290>]
It is sometimes necessary to transform the data so that the covariance is the identity, a decorrelation procedure called whitening. For example, given some points generated from a multivariate distribution:
mu1 = np.array([0.0, 0.0])
sigma1 = np.array([[1.0, 0.5], [0.5, 1.0]])
X = np.random.multivariate_normal(mu1, sigma1, 1000)
plt.plot(X[:,0], X[:,1], '.')
[<matplotlib.lines.Line2D at 0x10ff1c190>]
We can sphere or whiten then data through PCA:
from sklearn.decomposition import PCA
pca = PCA(whiten=True)
Xwhiten = pca.fit_transform(X)
plt.plot(Xwhiten[:,0], Xwhiten[:,1], '.')
[<matplotlib.lines.Line2D at 0x110083050>]
When working with text data, you often have to estimate empirical probabilities. CountVectorizer allows you to write a one-liner for this task. We will use the 20 Newsgroup dataset to illustate this usecase.
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups()
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(twenty_train.data)
vectorizer.vocabulary_.get(u'algorithm')
27366
By default, CountVectorizer
only extracts frequencies for single tokens. When looking at pair or triples of tokens, we then talk about n-grams models. CountVectorizer
takes a ngram_range
argument specifying the minimum and maximum order of the n-grams to extract.
The scikit-learn documentation has a dedicated page on text data. Make sure to take a look !
Oh, and here's a fun exercise ! Using CountVectorizer
, estimate the empirical probabilies of some text corpus and sample sequences of words starting from an initial word. This would give you a primitive Markov Chain approach to text generation.