Me: Chris Hausler¶

Today - Pandas and Scikit-Learn¶

And a lot of firsts¶

first MPUG meeting... Hi
first presentation using IPython Notebook

Python Data Analysis Library (pandas)¶

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
Creates somthing similar to R DataFrames.. but better

I think it's great, but I'm still a bit clumsy with it .. also the doco is still a little hit and miss¶

Some imports¶

In [1]:

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)

In [2]:

import numpy as np
import pandas as pd
import pylab as plt
import matplotlib 
%matplotlib inline
pd.__version__

Out[2]:

'0.13.1'

`pandas` has two main data structures: `Series` and `DataFrame`¶

Series - Like a one dimensional array but better¶

In [3]:

values = [5,3,4,8,2,9]
vals = pd.Series(values)
vals

Out[3]:

0    5
1    3
2    4
3    8
4    2
5    9
dtype: int64

Each value is now associated with an index. The index itself is an object of class Index and can be manipulated directly.

In [4]:

vals.index

Out[4]:

Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')

In [5]:

vals.values

Out[5]:

array([5, 3, 4, 8, 2, 9])

In [6]:

vals * 2.5

Out[6]:

0    12.5
1     7.5
2    10.0
3    20.0
4     5.0
5    22.5
dtype: float64

We can give named indexes

In [7]:

vals2 = pd.Series(values, index=['tom','sally','jeff','george','pablo','florence'])
vals2

Out[7]:

tom         5
sally       3
jeff        4
george      8
pablo       2
florence    9
dtype: int64

And use these to get the data we want

In [8]:

vals2[['florence','tom']]

Out[8]:

florence    9
tom         5
dtype: int64

In [9]:

vals2[['florence','tom','kate']]

Out[9]:

florence     9
tom          5
kate       NaN
dtype: float64

Dealing with missing values

In [10]:

vals3 = vals2[['tom','sally','pablo','florence','ricky','katrin']]
vals3

Out[10]:

tom          5
sally        3
pablo        2
florence     9
ricky      NaN
katrin     NaN
dtype: float64

Get rid of them

In [11]:

vals3.dropna()

Out[11]:

tom         5
sally       3
pablo       2
florence    9
dtype: float64

Fill them with a value

In [12]:

vals3.fillna(0)

Out[12]:

tom         5
sally       3
pablo       2
florence    9
ricky       0
katrin      0
dtype: float64

Fill them with a calculated value

In [13]:

vals3.fillna(vals3.mean())

Out[13]:

tom         5.00
sally       3.00
pablo       2.00
florence    9.00
ricky       4.75
katrin      4.75
dtype: float64

Use a function like forward fill

In [14]:

vals3.fillna(method='ffill')

Out[14]:

tom         5
sally       3
pablo       2
florence    9
ricky       9
katrin      9
dtype: float64

A handy way to get a picture of our data

In [15]:

vals3.describe()

Out[15]:

count    4.000000
mean     4.750000
std      3.095696
min      2.000000
25%      2.750000
50%      4.000000
75%      6.000000
max      9.000000
dtype: float64

DataFrame - Like a 2D array... with bells and whistles¶

In [16]:

vals.index=pd.Index(['tom','sally','pablo','florence','ricky','katrin'])
vals3=vals3[['tom','sally','pablo','florence','billy','katrin']]

In [17]:

# create a dataframe
dat = pd.DataFrame({'orig':vals,'new':vals3})
dat

Out[17]:

	new	orig
billy	NaN	NaN
florence	9	8
katrin	NaN	9
pablo	2	4
ricky	NaN	2
sally	3	3
tom	5	5

7 rows × 2 columns

Check for nulls

In [18]:

dat.isnull()

Out[18]:

	new	orig
billy	True	True
florence	False	False
katrin	True	False
pablo	False	False
ricky	True	False
sally	False	False
tom	False	False

7 rows × 2 columns

Drop rows with nulls

In [19]:

dat.dropna()

Out[19]:

	new	orig
florence	9	8
pablo	2	4
sally	3	3
tom	5	5

4 rows × 2 columns

Timeseries with pandas DataFrames - a winning combination¶

Data from google trends.. what correlates (+ve & -ve) with the search term Hipster

Read hipster correlations from a csv file¶

Pandas supports many file formats for read and write including

csv
json
pickle
the clipboard

In [20]:

hipster = pd.read_csv('hipster.csv')
hipster[:10]

Out[20]:

	Date	hipster	modcloth	gumtree perth
0	2004-01-04	-0.976	-0.817	-0.844
1	2004-01-11	-0.816	-0.817	-0.844
2	2004-01-18	-0.837	-0.817	-0.844
3	2004-01-25	-0.976	-0.817	-0.844
4	2004-02-01	-0.722	-0.817	-0.844
5	2004-02-08	-0.795	-0.817	-0.844
6	2004-02-15	-0.723	-0.817	-0.844
7	2004-02-22	-0.713	-0.817	-0.844
8	2004-02-29	-0.786	-0.817	-0.844
9	2004-03-07	-0.675	-0.817	-0.844

10 rows × 4 columns

Set the index to a datetime

In [21]:

hipster = hipster.set_index(pd.DatetimeIndex(hipster.pop('Date')))
hipster[:10]

Out[21]:

	hipster	modcloth	gumtree perth
2004-01-04	-0.976	-0.817	-0.844
2004-01-11	-0.816	-0.817	-0.844
2004-01-18	-0.837	-0.817	-0.844
2004-01-25	-0.976	-0.817	-0.844
2004-02-01	-0.722	-0.817	-0.844
2004-02-08	-0.795	-0.817	-0.844
2004-02-15	-0.723	-0.817	-0.844
2004-02-22	-0.713	-0.817	-0.844
2004-02-29	-0.786	-0.817	-0.844
2004-03-07	-0.675	-0.817	-0.844

10 rows × 3 columns

Now load the anti-Hipster data

In [22]:

not_hipster = pd.read_csv('negative-hipster.csv')
not_hipster = not_hipster.set_index(pd.DatetimeIndex(not_hipster.pop('Date')))

In [23]:

not_hipster[:10]

Out[23]:

	yellow pages	windows installer	techno
2004-01-04	1.341	0.668	0.871
2004-01-11	1.239	1.000	1.122
2004-01-18	1.022	0.768	1.053
2004-01-25	0.923	0.943	0.807
2004-02-01	0.904	0.799	0.612
2004-02-08	0.786	0.613	0.614
2004-02-15	0.729	0.956	0.391
2004-02-22	0.537	0.667	1.124
2004-02-29	0.534	1.415	1.078
2004-03-07	0.229	0.220	1.918

10 rows × 3 columns

Check the values of one column

In [24]:

hipster.hipster.head()

Out[24]:

2004-01-04   -0.976
2004-01-11   -0.816
2004-01-18   -0.837
2004-01-25   -0.976
2004-02-01   -0.722
Name: hipster, dtype: float64

Check another, but get them as an numpy.ndarray

In [25]:

hipster['gumtree perth'].values[:20]

Out[25]:

array([-0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844,
       -0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844,
       -0.844, -0.844, -0.844, -0.844])

View the data types, they don't need to be homogenous

In [26]:

hipster.dtypes

Out[26]:

hipster          float64
modcloth         float64
gumtree perth    float64
dtype: object

Joins on indexes are easy!

In [27]:

trend = hipster.join(not_hipster, how='inner')
trend.head()

Out[27]:

	hipster	modcloth	gumtree perth	yellow pages	windows installer	techno
2004-01-04	-0.976	-0.817	-0.844	1.341	0.668	0.871
2004-01-11	-0.816	-0.817	-0.844	1.239	1.000	1.122
2004-01-18	-0.837	-0.817	-0.844	1.022	0.768	1.053
2004-01-25	-0.976	-0.817	-0.844	0.923	0.943	0.807
2004-02-01	-0.722	-0.817	-0.844	0.904	0.799	0.612

5 rows × 6 columns

We can check the column names and values

In [28]:

trend.columns

Out[28]:

Index([u'hipster', u'modcloth', u'gumtree perth', u'yellow pages', u'windows installer', u'techno'], dtype='object')

In [29]:

trend.values

Out[29]:

array([[-0.976, -0.817, -0.844,  1.341,  0.668,  0.871],
       [-0.816, -0.817, -0.844,  1.239,  1.   ,  1.122],
       [-0.837, -0.817, -0.844,  1.022,  0.768,  1.053],
       ..., 
       [ 1.142,  1.175,  1.394, -1.69 , -1.77 , -1.836],
       [ 1.187,  1.221,  1.403, -1.706, -1.752, -1.796],
       [ 1.514,  1.216,  1.365, -1.72 , -1.701, -1.883]])

Filtering on date ranges is simple

In [30]:

trend['2012-01-01':].head()

Out[30]:

	hipster	modcloth	gumtree perth	yellow pages	windows installer	techno
2012-01-01	1.411	1.192	1.774	-1.077	-1.134	-1.285
2012-01-08	1.513	1.111	1.579	-0.995	-1.183	-1.189
2012-01-15	1.523	1.427	1.613	-1.027	-1.161	-1.337
2012-01-22	1.600	1.490	1.514	-1.140	-1.177	-1.345
2012-01-29	1.459	1.561	1.511	-1.046	-1.224	-1.233

5 rows × 6 columns

In [31]:

trend['2012-01-01': '2013-01-01'].tail(3)

Out[31]:

	hipster	modcloth	gumtree perth	yellow pages	windows installer	techno
2012-12-16	1.645	1.175	1.407	-1.433	-1.515	-1.687
2012-12-23	1.591	1.695	1.625	-1.698	-1.655	-1.504
2012-12-30	1.596	1.515	1.868	-1.515	-1.598	-1.674

3 rows × 6 columns

We can also grab a single date, or a subset of columns

In [32]:

trend.ix['2012-01-01', ['hipster', 'modcloth']]

Out[32]:

hipster     1.411
modcloth    1.192
Name: 2012-01-01 00:00:00, dtype: float64

Or do some boolean filtering

In [33]:

trend[trend.techno < 0].head()

Out[33]:

	hipster	modcloth	gumtree perth	yellow pages	windows installer	techno
2004-04-11	-0.510	-0.817	-0.844	0.521	0.301	-0.270
2006-01-29	-0.838	-0.817	-0.844	1.421	1.309	-0.081
2006-06-25	-0.799	-0.817	-0.833	1.142	1.458	-0.070
2010-01-24	-0.454	-0.183	-0.107	-0.017	0.053	-0.010
2010-01-31	-0.381	-0.276	-0.142	0.187	0.116	-0.044

5 rows × 6 columns

Plotting is built in and easier for dates than matplotlib

In [34]:

_ = trend.plot(figsize=(10, 6))
_ = plt.legend(loc='best', ncol=2)

We can also do it for a single column

In [35]:

_ = trend.hipster.cumsum().plot()

Or split the columns out to subplots

In [36]:

axs = trend.plot(subplots=True, figsize=(10, 10))

Resampling data is also straight forward.

In [37]:

# resample by month
trend.resample('M', how='mean').head()

Out[37]:

	hipster	modcloth	gumtree perth	yellow pages	windows installer	techno
2004-01-31	-0.90125	-0.817	-0.844	1.13125	0.84475	0.96325
2004-02-29	-0.74780	-0.817	-0.844	0.69800	0.89000	0.76380
2004-03-31	-0.78950	-0.817	-0.844	0.35650	0.73125	1.09175
2004-04-30	-0.70400	-0.817	-0.844	0.48125	0.89125	0.41950
2004-05-31	-0.81820	-0.817	-0.844	0.34780	0.62040	0.72860

5 rows × 6 columns

and Here by year, but one can do business day, week, month, quarter annual and a bunch of others

In [38]:

# resample by year
_ = trend.resample('A', how='mean').plot(figsize=(10, 10))

Other fancy plots include a scatter matrix including a kernel density estimation (KDE)

In [39]:

# look at the relations
_ = pd.scatter_matrix(trend, figsize=(12,8), diagonal='kde')

Titanic: Machine Learning from Disaster (kaggle.com)¶

Load the data, explore it and learn from it

In [40]:

df = pd.read_csv('train.csv', header=0)

In [41]:

df.head()

Out[41]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35	0	373450	8.0500	NaN	S

5 rows × 12 columns

Lets look at the data types here (this time they're heterogeneous)

In [42]:

df.dtypes

Out[42]:

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

We can also get a more verbose summary

In [43]:

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)

DataFrames can be grouped, like in SQL (it sucked to be a young male on the titanic)

In [44]:

df_grouped = df.groupby(['Pclass', 'Sex'])

In [45]:

df_grouped[['Age', 'Survived']].mean()

Out[45]:

		Age	Survived
Pclass	Sex
1	female	34.611765	0.968085
1	male	41.281386	0.368852
2	female	28.722973	0.921053
2	male	30.740707	0.157407
3	female	21.750000	0.500000
3	male	26.507589	0.135447

6 rows × 2 columns

Histograms are straightforward

In [46]:

ax = df['Age'].dropna().hist(bins=20, range=(0,100), alpha = .5)
ax.set_xlabel('Age')
ax.set_ylabel('Passenger Count')

Out[46]:

<matplotlib.text.Text at 0x7576ed0>

So are boxplots

In [47]:

bp = df.boxplot(column='Age', by='Pclass', grid=False)
for i in set(df.Pclass):
    y = df.Age[df.Pclass==i].dropna()
    # Add some random "jitter" to the x-axis
    x = np.random.normal(i, 0.04, size=len(y))
    plt.plot(x, y, 'r.', alpha=0.2)

If we want to do some learning on this data.. lets convert gender to a binary numeric

In [48]:

df['isFemale'] = df['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
df[['Sex','isFemale']].head()

Out[48]:

	Sex	isFemale
0	male	0
1	female	1
2	female	1
3	female	1
4	male	0

5 rows × 2 columns

Find non-numeric columns so we can drop them later

In [49]:

drop_cols = df.columns[df.dtypes.map(lambda x: x=='object')]
drop_cols

Out[49]:

Index([u'Name', u'Sex', u'Ticket', u'Cabin', u'Embarked'], dtype='object')

In [50]:

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 13 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
isFemale       891 non-null int64
dtypes: float64(2), int64(6), object(5)

Setup our data to learn from

In [51]:

X = pd.DataFrame(df[[c for c in df.columns if c != 'Survived']])
X = X.drop(drop_cols, axis=1) 
X = X.drop('PassengerId', axis=1)
y = df.Survived
print X.head()

   Pclass  Age  SibSp  Parch     Fare  isFemale
0       3   22      1      0   7.2500         0
1       1   38      1      0  71.2833         1
2       3   26      0      0   7.9250         1
3       1   35      1      0  53.1000         1
4       3   35      0      0   8.0500         0

[5 rows x 6 columns]

Have a quick look at the class distribution

In [52]:

y.groupby(y.values).count()

Out[52]:

0    549
1    342
dtype: int64

and fill in some NaNs for age

In [53]:

X['Age'] = X.Age.fillna(X.Age.median())

scikit-learn

Machine Learning in Python

Simple and efficient tools for data mining and data analysis
Accessible to everybody, and reusable in various contexts
Built on NumPy, SciPy, and matplotlib
Open source, commercially usable - BSD license

What is Machine Learning?¶

Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data thanks wikipedia

Prediction with scikit-learn is easy - who will survive?

In [54]:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score as acc

In [55]:

# create our classifier
clf = LogisticRegression()
# fit it to the data
clf.fit(X, y)
# and predict
preds = clf.predict(X)
res_acc = acc(y, preds)
print 'Accuracy Score: {:.2f}'.format(res_acc)
print 'Not too bad'

Accuracy Score: 0.80
Not too bad

Cross-validation is a fairer performance estimate¶

In [56]:

from sklearn.cross_validation import KFold

In [57]:

cv = KFold(n=len(y), n_folds=5, shuffle=True)
preds = np.zeros_like(y)
for train, test in cv:
    clf = LogisticRegression()
    clf.fit(X.ix[train], y.ix[train])
    preds[test] = clf.predict(X.ix[test])
res_acc = acc(y, preds)
print 'Accuracy Score: {:.2f}'.format(res_acc)

Accuracy Score: 0.79

And cross-validation can be done more easily

In [58]:

# scikits can actually take care of this for us
from sklearn.cross_validation import cross_val_score

# here
clf = LogisticRegression()
scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
# to here

print scores
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

[ 0.81564246  0.76966292  0.75842697  0.83707865  0.78651685]
Accuracy: 0.79 (+/- 0.06)

dealing with categorical data¶

In [59]:

df.Embarked.head()

Out[59]:

0    S
1    C
2    S
3    S
4    S
Name: Embarked, dtype: object

In [60]:

set(df.Embarked.fillna('O'))

Out[60]:

{'C', 'O', 'Q', 'S'}

Use the LabelEncoder

In [61]:

from sklearn import preprocessing
df.Embarked = df.Embarked.fillna('O')
le = preprocessing.LabelEncoder()
le.fit(df.Embarked.values)
le.classes_

Out[61]:

array(['C', 'O', 'Q', 'S'], dtype=object)

In [62]:

X['Embarked'] = le.transform(df.Embarked.values)
X.Embarked.head()

Out[62]:

0    3
1    0
2    3
3    3
4    3
Name: Embarked, dtype: int64

tuning classifier parameters¶

In [63]:

for C in [0.001, 0.01, 0.1, 1, 10, 100]:
    clf = LogisticRegression(C=C, penalty='l1')
    scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')    
    print("n_estimators: {:3.3f}\tAccuracy: {:.2f} (+/- {:.2f})"
          .format(C, scores.mean(), scores.std() * 2))

n_estimators: 0.001	Accuracy: 0.67 (+/- 0.03)
n_estimators: 0.010	Accuracy: 0.67 (+/- 0.03)
n_estimators: 0.100	Accuracy: 0.79 (+/- 0.05)
n_estimators: 1.000	Accuracy: 0.80 (+/- 0.06)
n_estimators: 10.000	Accuracy: 0.79 (+/- 0.04)
n_estimators: 100.000	Accuracy: 0.79 (+/- 0.04)

Comparing classifiers is easy¶

In [64]:

# normalise the data
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(X)

In [65]:

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.lda import LDA
from sklearn.qda import QDA

names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Decision Tree",
         "Random Forest", "AdaBoost", "Naive Bayes", "LDA",
         "QDA", "Logistic Regression"]
classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    AdaBoostClassifier(),
    GaussianNB(),
    LDA(),
    QDA(),
    LogisticRegression(class_weight='auto')]

In [66]:

# fit each classifier and find the mean performance
res = []
for name, clf in zip(names, classifiers):
    scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
    res.append(scores.mean())

In [67]:

import prettyplotlib as ppl
res = np.array(res)
names = np.array(names)
idx = np.argsort(res)[::-1]
fig, ax = plt.subplots(1, figsize=(14, 6))
ppl.bar(ax, np.arange(len(res)), res[idx], annotate=True,
        xticklabels=names[idx], grid='y')
plt.xticks(rotation=30)
_ = ax.set_ylim(res.min() * 0.95, res.max() * 1.05)

Models can be pickled

In [69]:

# models can be saved
import pickle
s = pickle.dumps(clf)

    

And there is a whole lot scikit-learn can do..¶

supervised learning¶
model evaluation¶
unsupervised learning¶
feature selection¶
feature extraction¶

by Andreas Mueller

Me: Chris Hausler¶

Today - Pandas and Scikit-Learn¶

And a lot of firsts¶

Python Data Analysis Library (pandas)¶

I think it's great, but I'm still a bit clumsy with it .. also the doco is still a little hit and miss¶

Some imports¶

pandas has two main data structures: Series and DataFrame¶

Series - Like a one dimensional array but better¶

DataFrame - Like a 2D array... with bells and whistles¶

Timeseries with pandas DataFrames - a winning combination¶

Read hipster correlations from a csv file¶

Titanic: Machine Learning from Disaster (kaggle.com)¶

scikit-learn

Machine Learning in Python

What is Machine Learning?¶

Cross-validation is a fairer performance estimate¶

dealing with categorical data¶

tuning classifier parameters¶

Comparing classifiers is easy¶

And there is a whole lot scikit-learn can do..¶

supervised learning¶

model evaluation¶

unsupervised learning¶

feature selection¶

feature extraction¶

`pandas` has two main data structures: `Series` and `DataFrame`¶