Scikit-learn @ Barcelona Python Meetup¶

Machine learning in Python
Download from: http://scikit-learn.org/
I've learnt it at BIG DIVE, Oct 2012, Turin
At The Barcelona Python Meetup Group, 7 Feb 2013.

by Piotr Migdał ¶

PhD student in theoretical quantum optics (ICFO, Castelldefels (Barcelona))

Exploring data¶

As an example, we will use Iris flower data set introduced by Sir Ronald Fisher.

In [1]:

from sklearn import datasets

In [2]:

datasets.load_iris()
iris = datasets.load_iris()

In [52]:

print iris.DESCR

Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:
    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)
    ============== ==== ==== ======= ===== ====================
    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris datasets.
http://archive.ics.uci.edu/ml/datasets/Iris

The famous Iris database, first used by Sir R.A Fisher

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher&apos;s paper is a classic in the field and
is referenced frequently to this day.  (See Duda &amp; Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

References
----------
   - Fisher,R.A. &quot;The use of multiple measurements in taxonomic problems&quot;
     Annual Eugenics, 7, Part II, 179-188 (1936); also in &quot;Contributions to
     Mathematical Statistics&quot; (John Wiley, NY, 1950).
   - Duda,R.O., &amp; Hart,P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley &amp; Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) &quot;Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments&quot;.  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) &quot;The Reduced Nearest Neighbor Rule&quot;.  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al&quot;s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

In [58]:

iris.target_names

Out[58]:

array([&apos;setosa&apos;, &apos;versicolor&apos;, &apos;virginica&apos;], 
      dtype=&apos;|S10&apos;)

In [59]:

from IPython.core.display import Image 

In [60]:

Image(filename='Iris_setosa.jpg') # from Wikipedia

Out[60]:

In [61]:

Image(filename='Iris_versicolor.jpg')

Out[61]:

In [144]:

Image(filename='Iris_virginica.jpg')

Out[144]:

In [4]:

iris.feature_names

Out[4]:

[&apos;sepal length (cm)&apos;,
 &apos;sepal width (cm)&apos;,
 &apos;petal length (cm)&apos;,
 &apos;petal width (cm)&apos;]

In [63]:

Image(filename="Petal-sepal.jpg")

Out[63]:

In [5]:

iris.data[:3]

Out[5]:

array([[ 5.1,  3.5,  1.4,  0.2],
       [ 4.9,  3. ,  1.4,  0.2],
       [ 4.7,  3.2,  1.3,  0.2]])

In [54]:

iris.target[:3]

Out[54]:

array([0, 0, 0])

In [132]:

colors = ('r', '#66ff66', '#4444ff')
X = iris.data
Y = iris.target

i, j = 2, 3
pylab.figure(figsize=(10, 10))
pylab.xlabel(iris.feature_names[i], fontsize = 20)
pylab.ylabel(iris.feature_names[j], fontsize = 20)
pylab.scatter(X[:, i], X[:, j], c=[colors[y] for y in Y], s=50)

Out[132]:

&lt;matplotlib.collections.CircleCollection at 0x5c30610&gt;

In [36]:

mean([x for i, x in enumerate(iris.data)
      if iris.target[i] == 0], axis=0)

Out[36]:

array([ 5.006,  3.418,  1.464,  0.244])

In [37]:

mean([x for i, x in enumerate(iris.data)
      if iris.target[i] == 1], axis=0)

Out[37]:

array([ 5.936,  2.77 ,  4.26 ,  1.326])

In [38]:

mean([x for i, x in enumerate(iris.data)
      if iris.target[i] == 2], axis=0)

Out[38]:

array([ 6.588,  2.974,  5.552,  2.026])

Using a classifier (for supervised learning)¶

In the example: K-Nearest Neighbours Classifier (voting of a few nearest neighbours).

Other popular:

Support vector machines (SVMs)
Random forest

In [11]:

from sklearn import neighbors

In [179]:

knc = neighbors.KNeighborsClassifier(n_neighbors=5, warn_on_equidistant=False)
step = 5
knc.fit(X[::step], Y[::step])

Out[179]:

KNeighborsClassifier(algorithm=&apos;auto&apos;, leaf_size=30, n_neighbors=5, p=2,
           warn_on_equidistant=False, weights=&apos;uniform&apos;)

In [180]:

knc.predict([5.93, 2.77, 4.23, 1.3])

Out[180]:

array([1])

In [181]:

knc.predict_proba([5.93, 2.77, 4.23, 1.3])

Out[181]:

array([[ 0.,  1.,  0.]])

In [182]:

Y_pred = knc.predict(X)

In [183]:

i, j = 2, 3

X_ok = array([x for k, x in enumerate(X) if Y[k] == Y_pred[k]])
Y_ok = [y for k, y in enumerate(Y) if y == Y_pred[k]]
X_bad = array([x for k, x in enumerate(X) if Y[k] != Y_pred[k]])

pylab.figure(figsize=(10, 10))
pylab.xlabel(iris.feature_names[i], fontsize = 20)
pylab.ylabel(iris.feature_names[j], fontsize = 20)

pylab.scatter(X_ok[:, i], X_ok[:, j], c=[colors[y] for y in Y_ok], s=50)
pylab.scatter(X[::step, i], X[::step, j], c="w", marker="s", s=15)
pylab.scatter(X_bad[:, i], X_bad[:, j], marker="x", s=50)

Out[183]:

&lt;matplotlib.collections.AsteriskPolygonCollection at 0xd7f0a30&gt;

Cross Validation¶

In [32]:

from sklearn import cross_validation

In [82]:

cross_val = cross_validation.KFold(n = len(X), k=10, shuffle=True)

In [184]:

def avg_miss(Y_true, Y_pred):
    return (Y_true != Y_pred).sum() / float(len(Y_true))

In [185]:

cross_validation.cross_val_score(knc, X, Y,
                                 cv=cross_val, score_func=avg_miss)

Out[185]:

array([ 0.06666667,  0.06666667,  0.        ,  0.06666667,  0.06666667,
        0.        ,  0.        ,  0.06666667,  0.        ,  0.        ])

Any questions?¶

Or:

Problems with installing it on Mac? https://gist.github.com/stared/4730202

by Piotr Migdał¶

PhD student in theoretical quantum optics (ICFO, Castelldefels (Barcelona))
pmigdal@gmail.com, http://migdal.wikidot.com/en
add me on G+ or Twitter (@pmigdal) or GitHub (stared) :)

Shameless self-advertisement¶

Open Science and Science 2.0;

also, gathering programmers wanting to upgrade academia/science

An independent camp for high school geeks

and gifted education

Offtopicarium, an unconference series I'm organizing
Also, I'm am looking for an internship in data science / software engineering!