Python has a scientific computing ecosystem that has existed since the mid-90s
This ecosystem is meant to emulate Matlab, and is geared to numerical data
Python has a mature natural language processing library, NLTK (Natural Language ToolKit), for symbolic and statistical natural language processing.
This ecosystem provided tools for data munging and analysis, but didn't necessarily make it easy.
No container for heterogeneous data types (a la data.frame
in R) that can be easily manipulated
Easy handling of missing data
Easy data munging capabilities
Exploratory data analysis, summaries
Statistical modeling and machine learning
pandas (Python data analysis toolbox) was first released in 2008. The current version is 0.12, released in July.
R makes users forget how fast a computer really is
John Myles White, SPDC, October 2013
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
import numpy as np
import pandas as pd
pd.__version__
'0.12.0'
pandas
has two main data structures: Series
and DataFrame
¶values = [5,3,4,8,2,9]
vals = pd.Series(values)
vals
0 5 1 3 2 4 3 8 4 2 5 9 dtype: int64
Each value is now associated with an index. The index itself is an object of class Index
and can be manipulated directly.
vals.index
Int64Index([0, 1, 2, 3, 4, 5], dtype=int64)
vals.values
array([5, 3, 4, 8, 2, 9])
vals2 = pd.Series(values, index=['a','b','c','d','e','f'])
vals2
a 5 b 3 c 4 d 8 e 2 f 9 dtype: int64
vals2[['b','d']]
b 3 d 8 dtype: int64
vals2[['e','f','g']]
e 2 f 9 g NaN dtype: float64
vals3 = vals2[['a','b','c','f','g','h']]
vals3
a 5 b 3 c 4 f 9 g NaN h NaN dtype: float64
vals3.isnull()
a False b False c False f False g True h True dtype: bool
vals3.dropna()
a 5 b 3 c 4 f 9 dtype: float64
vals3.fillna(0)
a 5 b 3 c 4 f 9 g 0 h 0 dtype: float64
vals3.fillna(vals3.mean())
a 5.00 b 3.00 c 4.00 f 9.00 g 5.25 h 5.25 dtype: float64
vals3.fillna(method='ffill')
a 5 b 3 c 4 f 9 g 9 h 9 dtype: float64
vals3.describe()
count 4.000000 mean 5.250000 std 2.629956 min 3.000000 25% 3.750000 50% 4.500000 75% 6.000000 max 9.000000 dtype: float64
vals.index=pd.Index(['a','b','c','d','e','f'])
vals3=vals3[['a','c','d','e','z']]
dat = pd.DataFrame({'orig':vals,'new':vals3})
dat
new | orig | |
---|---|---|
a | 5 | 5 |
b | NaN | 3 |
c | 4 | 4 |
d | NaN | 8 |
e | NaN | 2 |
f | NaN | 9 |
z | NaN | NaN |
dat.fillna(1)
new | orig | |
---|---|---|
a | 5 | 5 |
b | 1 | 3 |
c | 4 | 4 |
d | 1 | 8 |
e | 1 | 2 |
f | 1 | 9 |
z | 1 | 1 |
cars = pd.read_csv('mtcars.csv')
cars[:10]
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
1 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
2 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
3 | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
4 | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
5 | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
6 | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 |
7 | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
8 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
9 | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.30 | 1 | 0 | 4 | 4 |
cars.cyl
0 6 1 6 2 4 3 6 4 8 5 6 6 8 7 4 8 4 9 6 10 6 11 8 12 8 13 8 14 8 15 8 16 8 17 4 18 4 19 4 20 4 21 8 22 8 23 8 24 8 25 4 26 4 27 4 28 8 29 6 30 8 31 4 Name: cyl, dtype: int64
cars.ix[[5,7]]
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
5 | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.46 | 20.22 | 1 | 0 | 3 | 1 |
7 | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.19 | 20.00 | 1 | 0 | 4 | 2 |
cars['kmpg']=cars['mpg']*1.6
cars[:4]
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | kmpg | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 | 33.60 |
1 | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 | 33.60 |
2 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 | 36.48 |
3 | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 | 34.24 |
del cars['kmpg']
cars[:4]
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
1 | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
2 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
3 | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
cars.mpg[:10]
0 21.0 1 21.0 2 22.8 3 21.4 4 18.7 5 18.1 6 14.3 7 24.4 8 22.8 9 19.2 Name: mpg, dtype: float64
cars3=cars.stack()
cars3[:20]
0 mpg 21.000 cyl 6.000 disp 160.000 hp 110.000 drat 3.900 wt 2.620 qsec 16.460 vs 0.000 am 1.000 gear 4.000 carb 4.000 1 mpg 21.000 cyl 6.000 disp 160.000 hp 110.000 drat 3.900 wt 2.875 qsec 17.020 vs 0.000 am 1.000 dtype: float64
cars3.unstack()[:5]
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
1 | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
2 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
3 | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
4 | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
grouped=cars.groupby(['cyl','gear','carb'])
grouped['mpg'].mean()
cyl gear carb 4 3 1 21.50 4 1 29.10 2 24.75 5 2 28.20 6 3 1 19.75 4 4 19.75 5 6 19.70 8 3 2 17.15 3 16.30 4 12.62 5 4 15.80 8 15.00 Name: mpg, dtype: float64
stats = ['count','mean','median']
grouped.agg(stats)[['mpg','disp']]
mpg | disp | |||||||
---|---|---|---|---|---|---|---|---|
count | mean | median | count | mean | median | |||
cyl | gear | carb | ||||||
4 | 3 | 1 | 1 | 21.50 | 21.50 | 1 | 120.10 | 120.10 |
4 | 1 | 4 | 29.10 | 29.85 | 4 | 84.20 | 78.85 | |
2 | 4 | 24.75 | 23.60 | 4 | 121.05 | 130.90 | ||
5 | 2 | 2 | 28.20 | 28.20 | 2 | 107.70 | 107.70 | |
6 | 3 | 1 | 2 | 19.75 | 19.75 | 2 | 241.50 | 241.50 |
4 | 4 | 4 | 19.75 | 20.10 | 4 | 163.80 | 163.80 | |
5 | 6 | 1 | 19.70 | 19.70 | 1 | 145.00 | 145.00 | |
8 | 3 | 2 | 4 | 17.15 | 17.10 | 4 | 345.50 | 339.00 |
3 | 3 | 16.30 | 16.40 | 3 | 275.80 | 275.80 | ||
4 | 5 | 12.62 | 13.30 | 5 | 416.40 | 440.00 | ||
5 | 4 | 1 | 15.80 | 15.80 | 1 | 351.00 | 351.00 | |
8 | 1 | 15.00 | 15.00 | 1 | 301.00 | 301.00 |
grouped.first()
mpg | disp | hp | drat | wt | qsec | vs | am | |||
---|---|---|---|---|---|---|---|---|---|---|
cyl | gear | carb | ||||||||
4 | 3 | 1 | 21.5 | 120.1 | 97 | 3.70 | 2.465 | 20.01 | 1 | 0 |
4 | 1 | 22.8 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | |
2 | 24.4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | ||
5 | 2 | 26.0 | 120.3 | 91 | 4.43 | 2.140 | 16.70 | 0 | 1 | |
6 | 3 | 1 | 21.4 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 |
4 | 4 | 21.0 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | |
5 | 6 | 19.7 | 145.0 | 175 | 3.62 | 2.770 | 15.50 | 0 | 1 | |
8 | 3 | 2 | 18.7 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 |
3 | 16.4 | 275.8 | 180 | 3.07 | 4.070 | 17.40 | 0 | 0 | ||
4 | 14.3 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | ||
5 | 4 | 15.8 | 351.0 | 264 | 4.22 | 3.170 | 14.50 | 0 | 1 | |
8 | 15.0 | 301.0 | 335 | 3.54 | 3.570 | 14.60 | 0 | 1 |
cars[cars.cyl==4]
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
2 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
7 | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
8 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
17 | 32.4 | 4 | 78.7 | 66 | 4.08 | 2.200 | 19.47 | 1 | 1 | 4 | 1 |
18 | 30.4 | 4 | 75.7 | 52 | 4.93 | 1.615 | 18.52 | 1 | 1 | 4 | 2 |
19 | 33.9 | 4 | 71.1 | 65 | 4.22 | 1.835 | 19.90 | 1 | 1 | 4 | 1 |
20 | 21.5 | 4 | 120.1 | 97 | 3.70 | 2.465 | 20.01 | 1 | 0 | 3 | 1 |
25 | 27.3 | 4 | 79.0 | 66 | 4.08 | 1.935 | 18.90 | 1 | 1 | 4 | 1 |
26 | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.70 | 0 | 1 | 5 | 2 |
27 | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.90 | 1 | 1 | 5 | 2 |
31 | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 |
tips = pd.read_csv('tips.csv')
tips[:5]
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
tips['tip_pct'] = tips['tip']/tips['total_bill']*100
groupedtips = tips.groupby(['sex','smoker'])
groupedtips['tip_pct'].agg('mean')
sex smoker Female No 15.692097 Yes 18.215035 Male No 16.066872 Yes 15.277118 Name: tip_pct, dtype: float64
result=tips.groupby('smoker')['tip_pct'].describe()
result.unstack('smoker')
smoker | No | Yes |
---|---|---|
count | 151.000000 | 93.000000 |
mean | 15.932846 | 16.319604 |
std | 3.990977 | 8.511920 |
min | 5.679667 | 3.563814 |
25% | 13.690561 | 10.677083 |
50% | 15.562472 | 15.384615 |
75% | 18.501403 | 19.505852 |
max | 29.198966 | 71.034483 |
states = ['Ohio','New York','Vermont','Oregon','Washington','Nevada']
group_key=['East']*3 + ['West']*3
data = pd.Series(np.random.randn(6), index=states)
data[['Vermont','Washington']]=np.nan
data
Ohio 0.479346 New York -0.037588 Vermont NaN Oregon 0.943553 Washington NaN Nevada -1.184667 dtype: float64
data.groupby(group_key).mean()
East 0.220879 West -0.120557 dtype: float64
fill_mean=lambda g: g.fillna(g.mean())
data.groupby(group_key).apply(fill_mean)
Ohio 0.479346 New York -0.037588 Vermont 0.220879 Oregon 0.943553 Washington -0.120557 Nevada -1.184667 dtype: float64
tips.pivot_table(rows=['sex','smoker'])
size | tip | tip_pct | total_bill | ||
---|---|---|---|---|---|
sex | smoker | ||||
Female | No | 2.592593 | 2.773519 | 15.692097 | 18.105185 |
Yes | 2.242424 | 2.931515 | 18.215035 | 17.977879 | |
Male | No | 2.711340 | 3.113402 | 16.066872 | 19.791237 |
Yes | 2.500000 | 3.051167 | 15.277118 | 22.284500 |
tips.pivot_table(['tip_pct','size'],rows=['sex','day'], cols='smoker')
tip_pct | size | ||||
---|---|---|---|---|---|
smoker | No | Yes | No | Yes | |
sex | day | ||||
Female | Fri | 16.529591 | 20.912912 | 2.500000 | 2.000000 |
Sat | 14.799347 | 16.381673 | 2.307692 | 2.200000 | |
Sun | 16.570992 | 23.707473 | 3.071429 | 2.500000 | |
Thur | 15.597147 | 16.307257 | 2.480000 | 2.428571 | |
Male | Fri | 13.800498 | 14.473025 | 2.000000 | 2.125000 |
Sat | 16.213217 | 13.906681 | 2.656250 | 2.629630 | |
Sun | 15.829067 | 17.396381 | 2.883721 | 2.600000 | |
Thur | 16.570635 | 16.441676 | 2.500000 | 2.300000 |
Methods provided include
import statsmodels as sm
import statsmodels.formula.api as smf
cars[:5]
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
1 | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
2 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
3 | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
4 | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
mod1 = smf.ols('mpg~disp+hp+C(cyl)-1', data=cars) # change to category
mod1.fit().summary()
Dep. Variable: | mpg | R-squared: | 0.984 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.981 |
Method: | Least Squares | F-statistic: | 331.4 |
Date: | Wed, 06 Nov 2013 | Prob (F-statistic): | 2.44e-23 |
Time: | 00:57:11 | Log-Likelihood: | -76.620 |
No. Observations: | 32 | AIC: | 163.2 |
Df Residuals: | 27 | BIC: | 170.6 |
Df Model: | 5 |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
C(cyl)[4] | 31.1477 | 1.767 | 17.626 | 0.000 | 27.522 34.774 |
C(cyl)[6] | 27.1005 | 2.703 | 10.025 | 0.000 | 21.554 32.647 |
C(cyl)[8] | 28.7158 | 4.600 | 6.243 | 0.000 | 19.278 38.154 |
disp | -0.0260 | 0.010 | -2.499 | 0.019 | -0.047 -0.005 |
hp | -0.0211 | 0.014 | -1.490 | 0.148 | -0.050 0.008 |
Omnibus: | 1.696 | Durbin-Watson: | 1.653 |
---|---|---|---|
Prob(Omnibus): | 0.428 | Jarque-Bera (JB): | 1.579 |
Skew: | 0.469 | Prob(JB): | 0.454 |
Kurtosis: | 2.449 | Cond. No. | 3.25e+03 |
import sklearn as learn
from sklearn.ensemble import RandomForestRegressor
X = cars.values[:,1:]
y = cars.values[:,0]
rf = RandomForestRegressor(n_estimators=100)
rf = rf.fit(X, y)
ypred=rf.predict(X)
import matplotlib as mpl
import matplotlib.pyplot as plt
plt.plot(y,ypred,'.')
[<matplotlib.lines.Line2D at 0x108769950>]
import seaborn as sns
sns.set(palette="Purples_r")
mpl.rc("figure", figsize=(5, 5))
d = pd.DataFrame({'y':y,'ypred':ypred})
sns.lmplot('y','ypred',d)
sns.lmplot("total_bill","tip",tips, col="sex",color="time")
from IPython.core.display import HTML
def css_styling():
styles = open("styles/custom.css", "r").read()
return HTML(styles)
css_styling()