pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
Creates somthing similar to R DataFrames.. but better
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
import numpy as np
import pandas as pd
import pylab as plt
import matplotlib
%matplotlib inline
pd.__version__
'0.13.1'
pandas
has two main data structures: Series
and DataFrame
¶values = [5,3,4,8,2,9]
vals = pd.Series(values)
vals
0 5 1 3 2 4 3 8 4 2 5 9 dtype: int64
Each value is now associated with an index. The index itself is an object of class Index
and can be manipulated directly.
vals.index
Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')
vals.values
array([5, 3, 4, 8, 2, 9])
vals * 2.5
0 12.5 1 7.5 2 10.0 3 20.0 4 5.0 5 22.5 dtype: float64
We can give named indexes
vals2 = pd.Series(values, index=['tom','sally','jeff','george','pablo','florence'])
vals2
tom 5 sally 3 jeff 4 george 8 pablo 2 florence 9 dtype: int64
And use these to get the data we want
vals2[['florence','tom']]
florence 9 tom 5 dtype: int64
vals2[['florence','tom','kate']]
florence 9 tom 5 kate NaN dtype: float64
Dealing with missing values
vals3 = vals2[['tom','sally','pablo','florence','ricky','katrin']]
vals3
tom 5 sally 3 pablo 2 florence 9 ricky NaN katrin NaN dtype: float64
Get rid of them
vals3.dropna()
tom 5 sally 3 pablo 2 florence 9 dtype: float64
Fill them with a value
vals3.fillna(0)
tom 5 sally 3 pablo 2 florence 9 ricky 0 katrin 0 dtype: float64
Fill them with a calculated value
vals3.fillna(vals3.mean())
tom 5.00 sally 3.00 pablo 2.00 florence 9.00 ricky 4.75 katrin 4.75 dtype: float64
Use a function like forward fill
vals3.fillna(method='ffill')
tom 5 sally 3 pablo 2 florence 9 ricky 9 katrin 9 dtype: float64
A handy way to get a picture of our data
vals3.describe()
count 4.000000 mean 4.750000 std 3.095696 min 2.000000 25% 2.750000 50% 4.000000 75% 6.000000 max 9.000000 dtype: float64
vals.index=pd.Index(['tom','sally','pablo','florence','ricky','katrin'])
vals3=vals3[['tom','sally','pablo','florence','billy','katrin']]
# create a dataframe
dat = pd.DataFrame({'orig':vals,'new':vals3})
dat
new | orig | |
---|---|---|
billy | NaN | NaN |
florence | 9 | 8 |
katrin | NaN | 9 |
pablo | 2 | 4 |
ricky | NaN | 2 |
sally | 3 | 3 |
tom | 5 | 5 |
7 rows × 2 columns
Check for nulls
dat.isnull()
new | orig | |
---|---|---|
billy | True | True |
florence | False | False |
katrin | True | False |
pablo | False | False |
ricky | True | False |
sally | False | False |
tom | False | False |
7 rows × 2 columns
Drop rows with nulls
dat.dropna()
new | orig | |
---|---|---|
florence | 9 | 8 |
pablo | 2 | 4 |
sally | 3 | 3 |
tom | 5 | 5 |
4 rows × 2 columns
Data from google trends.. what correlates (+ve & -ve) with the search term Hipster
Pandas supports many file formats for read and write including
hipster = pd.read_csv('hipster.csv')
hipster[:10]
Date | hipster | modcloth | gumtree perth | |
---|---|---|---|---|
0 | 2004-01-04 | -0.976 | -0.817 | -0.844 |
1 | 2004-01-11 | -0.816 | -0.817 | -0.844 |
2 | 2004-01-18 | -0.837 | -0.817 | -0.844 |
3 | 2004-01-25 | -0.976 | -0.817 | -0.844 |
4 | 2004-02-01 | -0.722 | -0.817 | -0.844 |
5 | 2004-02-08 | -0.795 | -0.817 | -0.844 |
6 | 2004-02-15 | -0.723 | -0.817 | -0.844 |
7 | 2004-02-22 | -0.713 | -0.817 | -0.844 |
8 | 2004-02-29 | -0.786 | -0.817 | -0.844 |
9 | 2004-03-07 | -0.675 | -0.817 | -0.844 |
10 rows × 4 columns
Set the index to a datetime
hipster = hipster.set_index(pd.DatetimeIndex(hipster.pop('Date')))
hipster[:10]
hipster | modcloth | gumtree perth | |
---|---|---|---|
2004-01-04 | -0.976 | -0.817 | -0.844 |
2004-01-11 | -0.816 | -0.817 | -0.844 |
2004-01-18 | -0.837 | -0.817 | -0.844 |
2004-01-25 | -0.976 | -0.817 | -0.844 |
2004-02-01 | -0.722 | -0.817 | -0.844 |
2004-02-08 | -0.795 | -0.817 | -0.844 |
2004-02-15 | -0.723 | -0.817 | -0.844 |
2004-02-22 | -0.713 | -0.817 | -0.844 |
2004-02-29 | -0.786 | -0.817 | -0.844 |
2004-03-07 | -0.675 | -0.817 | -0.844 |
10 rows × 3 columns
Now load the anti-Hipster data
not_hipster = pd.read_csv('negative-hipster.csv')
not_hipster = not_hipster.set_index(pd.DatetimeIndex(not_hipster.pop('Date')))
not_hipster[:10]
yellow pages | windows installer | techno | |
---|---|---|---|
2004-01-04 | 1.341 | 0.668 | 0.871 |
2004-01-11 | 1.239 | 1.000 | 1.122 |
2004-01-18 | 1.022 | 0.768 | 1.053 |
2004-01-25 | 0.923 | 0.943 | 0.807 |
2004-02-01 | 0.904 | 0.799 | 0.612 |
2004-02-08 | 0.786 | 0.613 | 0.614 |
2004-02-15 | 0.729 | 0.956 | 0.391 |
2004-02-22 | 0.537 | 0.667 | 1.124 |
2004-02-29 | 0.534 | 1.415 | 1.078 |
2004-03-07 | 0.229 | 0.220 | 1.918 |
10 rows × 3 columns
Check the values of one column
hipster.hipster.head()
2004-01-04 -0.976 2004-01-11 -0.816 2004-01-18 -0.837 2004-01-25 -0.976 2004-02-01 -0.722 Name: hipster, dtype: float64
Check another, but get them as an numpy.ndarray
hipster['gumtree perth'].values[:20]
array([-0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844, -0.844])
View the data types, they don't need to be homogenous
hipster.dtypes
hipster float64 modcloth float64 gumtree perth float64 dtype: object
Joins on indexes are easy!
trend = hipster.join(not_hipster, how='inner')
trend.head()
hipster | modcloth | gumtree perth | yellow pages | windows installer | techno | |
---|---|---|---|---|---|---|
2004-01-04 | -0.976 | -0.817 | -0.844 | 1.341 | 0.668 | 0.871 |
2004-01-11 | -0.816 | -0.817 | -0.844 | 1.239 | 1.000 | 1.122 |
2004-01-18 | -0.837 | -0.817 | -0.844 | 1.022 | 0.768 | 1.053 |
2004-01-25 | -0.976 | -0.817 | -0.844 | 0.923 | 0.943 | 0.807 |
2004-02-01 | -0.722 | -0.817 | -0.844 | 0.904 | 0.799 | 0.612 |
5 rows × 6 columns
We can check the column names and values
trend.columns
Index([u'hipster', u'modcloth', u'gumtree perth', u'yellow pages', u'windows installer', u'techno'], dtype='object')
trend.values
array([[-0.976, -0.817, -0.844, 1.341, 0.668, 0.871], [-0.816, -0.817, -0.844, 1.239, 1. , 1.122], [-0.837, -0.817, -0.844, 1.022, 0.768, 1.053], ..., [ 1.142, 1.175, 1.394, -1.69 , -1.77 , -1.836], [ 1.187, 1.221, 1.403, -1.706, -1.752, -1.796], [ 1.514, 1.216, 1.365, -1.72 , -1.701, -1.883]])
Filtering on date ranges is simple
trend['2012-01-01':].head()
hipster | modcloth | gumtree perth | yellow pages | windows installer | techno | |
---|---|---|---|---|---|---|
2012-01-01 | 1.411 | 1.192 | 1.774 | -1.077 | -1.134 | -1.285 |
2012-01-08 | 1.513 | 1.111 | 1.579 | -0.995 | -1.183 | -1.189 |
2012-01-15 | 1.523 | 1.427 | 1.613 | -1.027 | -1.161 | -1.337 |
2012-01-22 | 1.600 | 1.490 | 1.514 | -1.140 | -1.177 | -1.345 |
2012-01-29 | 1.459 | 1.561 | 1.511 | -1.046 | -1.224 | -1.233 |
5 rows × 6 columns
trend['2012-01-01': '2013-01-01'].tail(3)
hipster | modcloth | gumtree perth | yellow pages | windows installer | techno | |
---|---|---|---|---|---|---|
2012-12-16 | 1.645 | 1.175 | 1.407 | -1.433 | -1.515 | -1.687 |
2012-12-23 | 1.591 | 1.695 | 1.625 | -1.698 | -1.655 | -1.504 |
2012-12-30 | 1.596 | 1.515 | 1.868 | -1.515 | -1.598 | -1.674 |
3 rows × 6 columns
We can also grab a single date, or a subset of columns
trend.ix['2012-01-01', ['hipster', 'modcloth']]
hipster 1.411 modcloth 1.192 Name: 2012-01-01 00:00:00, dtype: float64
Or do some boolean filtering
trend[trend.techno < 0].head()
hipster | modcloth | gumtree perth | yellow pages | windows installer | techno | |
---|---|---|---|---|---|---|
2004-04-11 | -0.510 | -0.817 | -0.844 | 0.521 | 0.301 | -0.270 |
2006-01-29 | -0.838 | -0.817 | -0.844 | 1.421 | 1.309 | -0.081 |
2006-06-25 | -0.799 | -0.817 | -0.833 | 1.142 | 1.458 | -0.070 |
2010-01-24 | -0.454 | -0.183 | -0.107 | -0.017 | 0.053 | -0.010 |
2010-01-31 | -0.381 | -0.276 | -0.142 | 0.187 | 0.116 | -0.044 |
5 rows × 6 columns
Plotting is built in and easier for dates than matplotlib
_ = trend.plot(figsize=(10, 6))
_ = plt.legend(loc='best', ncol=2)
We can also do it for a single column
_ = trend.hipster.cumsum().plot()
Or split the columns out to subplots
axs = trend.plot(subplots=True, figsize=(10, 10))
Resampling data is also straight forward.
# resample by month
trend.resample('M', how='mean').head()
hipster | modcloth | gumtree perth | yellow pages | windows installer | techno | |
---|---|---|---|---|---|---|
2004-01-31 | -0.90125 | -0.817 | -0.844 | 1.13125 | 0.84475 | 0.96325 |
2004-02-29 | -0.74780 | -0.817 | -0.844 | 0.69800 | 0.89000 | 0.76380 |
2004-03-31 | -0.78950 | -0.817 | -0.844 | 0.35650 | 0.73125 | 1.09175 |
2004-04-30 | -0.70400 | -0.817 | -0.844 | 0.48125 | 0.89125 | 0.41950 |
2004-05-31 | -0.81820 | -0.817 | -0.844 | 0.34780 | 0.62040 | 0.72860 |
5 rows × 6 columns
and Here by year, but one can do business day, week, month, quarter annual and a bunch of others
# resample by year
_ = trend.resample('A', how='mean').plot(figsize=(10, 10))
Other fancy plots include a scatter matrix including a kernel density estimation (KDE)
# look at the relations
_ = pd.scatter_matrix(trend, figsize=(12,8), diagonal='kde')
Load the data, explore it and learn from it
df = pd.read_csv('train.csv', header=0)
df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 rows × 12 columns
Lets look at the data types here (this time they're heterogeneous)
df.dtypes
PassengerId int64 Survived int64 Pclass int64 Name object Sex object Age float64 SibSp int64 Parch int64 Ticket object Fare float64 Cabin object Embarked object dtype: object
We can also get a more verbose summary
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5)
DataFrames can be grouped, like in SQL (it sucked to be a young male on the titanic)
df_grouped = df.groupby(['Pclass', 'Sex'])
df_grouped[['Age', 'Survived']].mean()
Age | Survived | ||
---|---|---|---|
Pclass | Sex | ||
1 | female | 34.611765 | 0.968085 |
male | 41.281386 | 0.368852 | |
2 | female | 28.722973 | 0.921053 |
male | 30.740707 | 0.157407 | |
3 | female | 21.750000 | 0.500000 |
male | 26.507589 | 0.135447 |
6 rows × 2 columns
Histograms are straightforward
ax = df['Age'].dropna().hist(bins=20, range=(0,100), alpha = .5)
ax.set_xlabel('Age')
ax.set_ylabel('Passenger Count')
<matplotlib.text.Text at 0x7576ed0>
So are boxplots
bp = df.boxplot(column='Age', by='Pclass', grid=False)
for i in set(df.Pclass):
y = df.Age[df.Pclass==i].dropna()
# Add some random "jitter" to the x-axis
x = np.random.normal(i, 0.04, size=len(y))
plt.plot(x, y, 'r.', alpha=0.2)
If we want to do some learning on this data.. lets convert gender to a binary numeric
df['isFemale'] = df['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
df[['Sex','isFemale']].head()
Sex | isFemale | |
---|---|---|
0 | male | 0 |
1 | female | 1 |
2 | female | 1 |
3 | female | 1 |
4 | male | 0 |
5 rows × 2 columns
Find non-numeric columns so we can drop them later
drop_cols = df.columns[df.dtypes.map(lambda x: x=='object')]
drop_cols
Index([u'Name', u'Sex', u'Ticket', u'Cabin', u'Embarked'], dtype='object')
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 891 entries, 0 to 890 Data columns (total 13 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object isFemale 891 non-null int64 dtypes: float64(2), int64(6), object(5)
Setup our data to learn from
X = pd.DataFrame(df[[c for c in df.columns if c != 'Survived']])
X = X.drop(drop_cols, axis=1)
X = X.drop('PassengerId', axis=1)
y = df.Survived
print X.head()
Pclass Age SibSp Parch Fare isFemale 0 3 22 1 0 7.2500 0 1 1 38 1 0 71.2833 1 2 3 26 0 0 7.9250 1 3 1 35 1 0 53.1000 1 4 3 35 0 0 8.0500 0 [5 rows x 6 columns]
Have a quick look at the class distribution
y.groupby(y.values).count()
0 549 1 342 dtype: int64
and fill in some NaNs for age
X['Age'] = X.Age.fillna(X.Age.median())
Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data thanks wikipedia
Prediction with scikit-learn is easy - who will survive?
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score as acc
# create our classifier
clf = LogisticRegression()
# fit it to the data
clf.fit(X, y)
# and predict
preds = clf.predict(X)
res_acc = acc(y, preds)
print 'Accuracy Score: {:.2f}'.format(res_acc)
print 'Not too bad'
Accuracy Score: 0.80 Not too bad
from sklearn.cross_validation import KFold
cv = KFold(n=len(y), n_folds=5, shuffle=True)
preds = np.zeros_like(y)
for train, test in cv:
clf = LogisticRegression()
clf.fit(X.ix[train], y.ix[train])
preds[test] = clf.predict(X.ix[test])
res_acc = acc(y, preds)
print 'Accuracy Score: {:.2f}'.format(res_acc)
Accuracy Score: 0.79
And cross-validation can be done more easily
# scikits can actually take care of this for us
from sklearn.cross_validation import cross_val_score
# here
clf = LogisticRegression()
scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
# to here
print scores
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
[ 0.81564246 0.76966292 0.75842697 0.83707865 0.78651685] Accuracy: 0.79 (+/- 0.06)
df.Embarked.head()
0 S 1 C 2 S 3 S 4 S Name: Embarked, dtype: object
set(df.Embarked.fillna('O'))
{'C', 'O', 'Q', 'S'}
Use the LabelEncoder
from sklearn import preprocessing
df.Embarked = df.Embarked.fillna('O')
le = preprocessing.LabelEncoder()
le.fit(df.Embarked.values)
le.classes_
array(['C', 'O', 'Q', 'S'], dtype=object)
X['Embarked'] = le.transform(df.Embarked.values)
X.Embarked.head()
0 3 1 0 2 3 3 3 4 3 Name: Embarked, dtype: int64
for C in [0.001, 0.01, 0.1, 1, 10, 100]:
clf = LogisticRegression(C=C, penalty='l1')
scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
print("n_estimators: {:3.3f}\tAccuracy: {:.2f} (+/- {:.2f})"
.format(C, scores.mean(), scores.std() * 2))
n_estimators: 0.001 Accuracy: 0.67 (+/- 0.03) n_estimators: 0.010 Accuracy: 0.67 (+/- 0.03) n_estimators: 0.100 Accuracy: 0.79 (+/- 0.05) n_estimators: 1.000 Accuracy: 0.80 (+/- 0.06) n_estimators: 10.000 Accuracy: 0.79 (+/- 0.04) n_estimators: 100.000 Accuracy: 0.79 (+/- 0.04)
# normalise the data
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(X)
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.lda import LDA
from sklearn.qda import QDA
names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Decision Tree",
"Random Forest", "AdaBoost", "Naive Bayes", "LDA",
"QDA", "Logistic Regression"]
classifiers = [
KNeighborsClassifier(3),
SVC(kernel="linear", C=0.025),
SVC(gamma=2, C=1),
DecisionTreeClassifier(),
RandomForestClassifier(),
AdaBoostClassifier(),
GaussianNB(),
LDA(),
QDA(),
LogisticRegression(class_weight='auto')]
# fit each classifier and find the mean performance
res = []
for name, clf in zip(names, classifiers):
scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
res.append(scores.mean())
import prettyplotlib as ppl
res = np.array(res)
names = np.array(names)
idx = np.argsort(res)[::-1]
fig, ax = plt.subplots(1, figsize=(14, 6))
ppl.bar(ax, np.arange(len(res)), res[idx], annotate=True,
xticklabels=names[idx], grid='y')
plt.xticks(rotation=30)
_ = ax.set_ylim(res.min() * 0.95, res.max() * 1.05)
Models can be pickled
# models can be saved
import pickle
s = pickle.dumps(clf)
by Andreas Mueller