Statistical Computing Languages
Python
Statistical Computing Languages, RSS
# rerun Fernando's script to load in css
%run talktools
The history saving thread hit an unexpected error (DatabaseError('database disk image is malformed',)).History will not be written to the database.
Python has the style of a scripting language (cf perl
, bash
)
# Open my web page and look for links (modified by code on Guido's blog)
import re
import urllib
regex = re.compile(r'href="([^"]+\.html)"')
def matcher(url, max=10):
"Print the first several URL references in a given url."
data = urllib.urlopen(url).read()
hits = regex.findall(data)
return hits
hits = matcher("http://staffwww.dcs.shef.ac.uk/people/N.Lawrence/index.html")
for hit in hits:
print hit
/people/N.Lawrence/inaugural.html /people/N.Lawrence/lectureNotes.html /people/N.Lawrence/research.html http://ml.dcs.shef.ac.uk/sitran/talks.html /people/N.Lawrence/software.html /people/N.Lawrence/miscellany.html /people/N.Lawrence/notes.html ./thematic08.html kappenball.html inaugural.html http://ml.dcs.shef.ac.uk/gpss/past_meetings.html ./inaugural.html http://maths.dept.shef.ac.uk/maths/staff_info_452.html http://staffwww.dcs.sheffield.ac.uk/people/N.Fusi/index.html http://staffwww.dcs.sheffield.ac.uk/people/A.Damianou/index.html ./3PhaseData.html ./inaugural.html ./com4509_6509_2012.html ../A.Cowling/campus_only/teach/com3310.html http://ttic.uchicago.edu/~rurtasun/tutorials/GP_tutorial.html ./mlss2012.html http://www.kyb.mpg.de/nc/employee/details/bs.html ./matweave.html ./thematic.html ./licsb.html ./com4509_6509_2011.html campus_only/lectureNotes.html ./interspeech_tutorial.html http://www.sheffield.ac.uk/dcs/postgrad/resdegrees/newphdhome.html ./icml_tutorial.html ./thematic08.html ./thematic.html http://www.cs.ucl.ac.uk/people/D.Barber.html http://www.well.ox.ac.uk/~mtitsias/publications.html ./chi.html
S
, MATLAB
)java
).import pandas as pd
import numpy as np
class my_data(pd.DataFrame):
def pronounce(self):
print "Extending existing objects is easy, this encourages code reuse and good programming practice."
a = my_data(np.zeros((3, 3)))
a.pronounce()
a.describe()
Extending existing objects is easy, this encourages code reuse and good programming practice.
0 | 1 | 2 | |
---|---|---|---|
count | 3 | 3 | 3 |
mean | 0 | 0 | 0 |
std | 0 | 0 | 0 |
min | 0 | 0 | 0 |
25% | 0 | 0 | 0 |
50% | 0 | 0 | 0 |
75% | 0 | 0 | 0 |
max | 0 | 0 | 0 |
MATLAB
.pandas
: data anlaysis in python.R
. Proving very popular for data analysis.scipy
stack catpures a range of scientific computing facilities (FFT, linear algebra etc)scikit-learn
encodes models from a machine learning perspective.statsmodels
encodes models from a statistics perspective.MATLAB
, R
, Julia
...MATLAB
code base to Python.R
and octave
which are GPL licensed)pandas
to load in data.import pandas as pd # first we import the pandas library
df = pd.read_csv('../R/data/regression.csv') # now we read in the data
# gender is stored as male/female. We convert it to 0/1
df['Gender'] = df['Sex'].map( {'Female': 0, 'Male': 1} ).astype(int)
df.describe()
OI | Age | Gender | |
---|---|---|---|
count | 101.00000 | 101.000000 | 101.000000 |
mean | 6.18495 | 48.940594 | 0.198020 |
std | 3.17320 | 14.413759 | 0.400495 |
min | 0.98000 | -10.000000 | 0.000000 |
25% | 3.75000 | 39.000000 | 0.000000 |
50% | 5.65000 | 52.000000 | 0.000000 |
75% | 7.90000 | 60.000000 | 0.000000 |
max | 16.50000 | 73.000000 | 1.000000 |
matplotlib
library.# import a plotting library: matplotlib
import matplotlib.pyplot as plt
# ask for plots to appear in the notebook
%matplotlib inline
fig, ax = plt.subplots(figsize=(10, 7))
ax.scatter(df.Age, df.OI)
ax.set_xlabel('Age')
ax.set_ylabel('OI')
<matplotlib.text.Text at 0x10cc23950>
pandas
allows for removal by indexing according to 'truth values'.df = df[df['Age']>0]
fig, ax = plt.subplots(figsize=(10, 7))
for symbol, sex in zip(['bo', 'ro'], ['Male', 'Female']):
lines= ax.semilogy(df.Age[df.Sex==sex], df.OI[df.Sex==sex], symbol,linewidth=3,markersize=10)
ax.set_xlabel('Age').set_fontsize(18)
ax.set_ylabel('OI').set_fontsize(18)
ax.set_title('Plot of age against OI on a log scale').set_fontsize(20)
import statsmodels.formula.api as sm
df['Eins'] = 1
Y = np.log(df.OI)
Y.name = 'np.log(OI)'
X = df[['Age','Gender', 'Eins']]
result = sm.OLS( Y, X ).fit()
result.summary()
#import statsmodels.graphics.regressionplots as rp
#rp.plot_fit(results, 1)
Dep. Variable: | np.log(OI) | R-squared: | 0.262 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.247 |
Method: | Least Squares | F-statistic: | 17.23 |
Date: | Sun, 23 Nov 2014 | Prob (F-statistic): | 3.96e-07 |
Time: | 12:01:08 | Log-Likelihood: | -61.671 |
No. Observations: | 100 | AIC: | 129.3 |
Df Residuals: | 97 | BIC: | 137.2 |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Age | 0.0162 | 0.004 | 4.603 | 0.000 | 0.009 0.023 |
Gender | 0.3189 | 0.116 | 2.757 | 0.007 | 0.089 0.548 |
Eins | 0.8292 | 0.178 | 4.666 | 0.000 | 0.477 1.182 |
Omnibus: | 3.016 | Durbin-Watson: | 2.007 |
---|---|---|---|
Prob(Omnibus): | 0.221 | Jarque-Bera (JB): | 2.772 |
Skew: | -0.408 | Prob(JB): | 0.250 |
Kurtosis: | 2.975 | Cond. No. | 200. |
statsmodels
seems designed for those with experience of statistics and R
.
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
# Fit regression model (using the natural log for the dependent variable)
results = smf.ols('np.log(OI) ~ Age + Sex', data=df).fit()
# Inspect the results
results.summary()
Dep. Variable: | np.log(OI) | R-squared: | 0.262 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.247 |
Method: | Least Squares | F-statistic: | 17.23 |
Date: | Sun, 23 Nov 2014 | Prob (F-statistic): | 3.96e-07 |
Time: | 12:01:16 | Log-Likelihood: | -61.671 |
No. Observations: | 100 | AIC: | 129.3 |
Df Residuals: | 97 | BIC: | 137.2 |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 0.8292 | 0.178 | 4.666 | 0.000 | 0.477 1.182 |
Sex[T.Male] | 0.3189 | 0.116 | 2.757 | 0.007 | 0.089 0.548 |
Age | 0.0162 | 0.004 | 4.603 | 0.000 | 0.009 0.023 |
Omnibus: | 3.016 | Durbin-Watson: | 2.007 |
---|---|---|---|
Prob(Omnibus): | 0.221 | Jarque-Bera (JB): | 2.772 |
Skew: | -0.408 | Prob(JB): | 0.250 |
Kurtosis: | 2.975 | Cond. No. | 200. |
R
datasets.import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
# Load data
dat = sm.datasets.get_rdataset("Guerry", "HistData").data
# Fit regression model (using the natural log of one of the regressors)
results = smf.ols('Lottery ~ Literacy + np.log(Pop1831)', data=dat).fit()
# Inspect the results
results.summary()
Dep. Variable: | Lottery | R-squared: | 0.348 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.333 |
Method: | Least Squares | F-statistic: | 22.20 |
Date: | Sun, 23 Nov 2014 | Prob (F-statistic): | 1.90e-08 |
Time: | 12:01:22 | Log-Likelihood: | -379.82 |
No. Observations: | 86 | AIC: | 765.6 |
Df Residuals: | 83 | BIC: | 773.0 |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 246.4341 | 35.233 | 6.995 | 0.000 | 176.358 316.510 |
Literacy | -0.4889 | 0.128 | -3.832 | 0.000 | -0.743 -0.235 |
np.log(Pop1831) | -31.3114 | 5.977 | -5.239 | 0.000 | -43.199 -19.424 |
Omnibus: | 3.713 | Durbin-Watson: | 2.019 |
---|---|---|---|
Prob(Omnibus): | 0.156 | Jarque-Bera (JB): | 3.394 |
Skew: | -0.487 | Prob(JB): | 0.183 |
Kurtosis: | 3.003 | Cond. No. | 702. |
scikit-learn
would be more familiar.from sklearn.linear_model import LinearRegression
model = LinearRegression()
X, y = df[['Age', 'Gender']], np.log(df['OI']).values[:, None]
model.fit( X, y )
score = '{0:.3f}'.format( model.score( X, y ) )
print model.coef_
[[ 0.01620833 0.31889832]]
for $\boldsymbol{\beta}$.
import numpy as np
beta = np.linalg.solve(np.dot(X.T, X), np.dot(X.T, y))
beta = pd.DataFrame(beta, index=X.columns)
beta
0 | |
---|---|
Age | 0.016208 |
Gender | 0.318898 |
Eins | 0.829203 |
Although Darren Wilkinson pointed out to me at the meeting, that this is more correctly done with QR decomposition (see this page for a description).
$$\mathbf{X}^\top \mathbf{X} \boldsymbol{\beta} = \mathbf{X}^\top \mathbf{y}$$$$(\mathbf{Q}\mathbf{R})^\top (\mathbf{Q}\mathbf{R})\boldsymbol{\beta} = (\mathbf{Q}\mathbf{R})^\top \mathbf{y}$$$$\mathbf{R}^\top (\mathbf{Q}^\top \mathbf{Q}) \mathbf{R} \boldsymbol{\beta} = \mathbf{R}^\top \mathbf{Q}^\top \mathbf{y}$$$$\mathbf{R}^\top \mathbf{R} \boldsymbol{\beta} = \mathbf{R}^\top \mathbf{Q}^\top \mathbf{y}$$$$\mathbf{R} \boldsymbol{\beta} = \mathbf{Q}^\top \mathbf{y}$$So if we let $\mathbf{z} = \mathbf{Q}^\top \mathbf{y}$, $$\mathbf{R} \boldsymbol{\beta} =\mathbf{z}$$
import numpy as np
import scipy as sp
Q, R = np.linalg.qr(X)
beta = sp.linalg.solve_triangular(R, np.dot(Q.T, y))
beta = pd.DataFrame(beta, index=X.columns)
beta
0 | |
---|---|
Age | 0.016208 |
Gender | 0.318898 |
Eins | 0.829203 |