import numpy as np
import statsmodels.api as sm
import matplotlib
import matplotlib.pyplot as plt
from statsmodels.sandbox.regression.predstd import wls_prediction_std
np.random.seed(9876789)
Artificial data:
nsample = 100
x = np.linspace(0, 10, 100)
X = np.column_stack((x, x**2))
beta = np.array([1, 0.1, 10])
e = np.random.normal(size=nsample)
Our model needs an intercept so we add a column of 1s:
X = sm.add_constant(X)
y = np.dot(X, beta) + e
/usr/local/lib/python2.7/dist-packages/statsmodels-0.5.0-py2.7-linux-x86_64.egg/statsmodels/tools/tools.py:306: FutureWarning: The default of `prepend` will be changed to True in 0.5.0, use explicit prepend FutureWarning)
Inspect data:
X = sm.add_constant(X)
y = np.dot(X, beta) + e
Fit and summary:
model = sm.OLS(y, X)
results = model.fit()
print results.summary()
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.968 Model: OLS Adj. R-squared: 0.968 Method: Least Squares F-statistic: 1479. Date: Sun, 26 Aug 2012 Prob (F-statistic): 2.19e-73 Time: 20:52:52 Log-Likelihood: -146.51 No. Observations: 100 AIC: 299.0 Df Residuals: 97 BIC: 306.8 Df Model: 2 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ x1 0.8598 0.145 5.948 0.000 0.573 1.147 x2 0.1103 0.014 7.883 0.000 0.082 0.138 const 10.3423 0.313 33.072 0.000 9.722 10.963 ============================================================================== Omnibus: 2.042 Durbin-Watson: 2.274 Prob(Omnibus): 0.360 Jarque-Bera (JB): 1.875 Skew: 0.234 Prob(JB): 0.392 Kurtosis: 2.519 Cond. No. 144. ==============================================================================
Quantities of interest can be extracted directly from the fitted model. Type dir(results)
for a full list. Here are some examples:
print 'Parameters: ', results.params
print 'R2: ', results.rsquared
Parameters: [ 0.85975052 0.11025357 10.34233516] R2: 0.968242518536
We simulate artificial data with a non-linear relationship between x and y:
nsample = 50
sig = 0.5
x = np.linspace(0, 20, nsample)
X = np.c_[x, np.sin(x), (x-5)**2, np.ones(nsample)]
beta = [0.5, 0.5, -0.02, 5.]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)
Fit and summary:
res = sm.OLS(y, X).fit()
print res.summary()
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.933 Model: OLS Adj. R-squared: 0.928 Method: Least Squares F-statistic: 211.8 Date: Sun, 26 Aug 2012 Prob (F-statistic): 6.30e-27 Time: 20:52:52 Log-Likelihood: -34.438 No. Observations: 50 AIC: 76.88 Df Residuals: 46 BIC: 84.52 Df Model: 3 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ x1 0.4687 0.026 17.751 0.000 0.416 0.522 x2 0.4836 0.104 4.659 0.000 0.275 0.693 x3 -0.0174 0.002 -7.507 0.000 -0.022 -0.013 const 5.2058 0.171 30.405 0.000 4.861 5.550 ============================================================================== Omnibus: 0.655 Durbin-Watson: 2.896 Prob(Omnibus): 0.721 Jarque-Bera (JB): 0.360 Skew: 0.207 Prob(JB): 0.835 Kurtosis: 3.026 Cond. No. 221. ==============================================================================
Extract other quantities of interest:
print 'Parameters: ', res.params
print 'Standard errors: ', res.bse
print 'Predicted values: ', res.predict()
Parameters: [ 0.46872448 0.48360119 -0.01740479 5.20584496] Standard errors: [ 0.02640602 0.10380518 0.00231847 0.17121765] Predicted values: [ 4.77072516 5.22213464 5.63620761 5.98658823 6.25643234 6.44117491 6.54928009 6.60085051 6.62432454 6.6518039 6.71377946 6.83412169 7.02615877 7.29048685 7.61487206 7.97626054 8.34456611 8.68761335 8.97642389 9.18997755 9.31866582 9.36587056 9.34740836 9.28893189 9.22171529 9.17751587 9.1833565 9.25708583 9.40444579 9.61812821 9.87897556 10.15912843 10.42660281 10.65054491 10.8063004 10.87946503 10.86825119 10.78378163 10.64826203 10.49133265 10.34519853 10.23933827 10.19566084 10.22490593 10.32487947 10.48081414 10.66779556 10.85485568 11.01006072 11.10575781]
Draw a plot to compare the true relationship to OLS predictions. Confidence intervals around the predictions are built using the wls_prediction_std
command.
plt.figure()
plt.plot(x, y, 'o', x, y_true, 'b-')
prstd, iv_l, iv_u = wls_prediction_std(res)
plt.plot(x, res.fittedvalues, 'r--.')
plt.plot(x, iv_u, 'r--')
plt.plot(x, iv_l, 'r--')
plt.title('blue: true, red: OLS')
<matplotlib.text.Text at 0x469b110>
We generate some artificial data. There are 3 groups which will be modelled using dummy variables. Group 0 is the omitted/benchmark category.
nsample = 50
groups = np.zeros(nsample, int)
groups[20:40] = 1
groups[40:] = 2
dummy = (groups[:,None] == np.unique(groups)).astype(float)
x = np.linspace(0, 20, nsample)
X = np.c_[x, dummy[:,1:], np.ones(nsample)]
beta = [1., 3, -3, 10]
y_true = np.dot(X, beta)
e = np.random.normal(size=nsample)
y = y_true + e
Inspect the data:
print X[:5,:]
print y[:5]
print groups
print dummy[:5,:]
[[ 0. 0. 0. 1. ] [ 0.40816327 0. 0. 1. ] [ 0.81632653 0. 0. 1. ] [ 1.2244898 0. 0. 1. ] [ 1.63265306 0. 0. 1. ]] [ 9.28223335 10.50481865 11.84389206 10.38508408 12.37941998] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2] [[ 1. 0. 0.] [ 1. 0. 0.] [ 1. 0. 0.] [ 1. 0. 0.] [ 1. 0. 0.]]
Fit and summary:
res2 = sm.OLS(y, X).fit()
print res.summary()
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.933 Model: OLS Adj. R-squared: 0.928 Method: Least Squares F-statistic: 211.8 Date: Sun, 26 Aug 2012 Prob (F-statistic): 6.30e-27 Time: 20:52:53 Log-Likelihood: -34.438 No. Observations: 50 AIC: 76.88 Df Residuals: 46 BIC: 84.52 Df Model: 3 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ x1 0.4687 0.026 17.751 0.000 0.416 0.522 x2 0.4836 0.104 4.659 0.000 0.275 0.693 x3 -0.0174 0.002 -7.507 0.000 -0.022 -0.013 const 5.2058 0.171 30.405 0.000 4.861 5.550 ============================================================================== Omnibus: 0.655 Durbin-Watson: 2.896 Prob(Omnibus): 0.721 Jarque-Bera (JB): 0.360 Skew: 0.207 Prob(JB): 0.835 Kurtosis: 3.026 Cond. No. 221. ==============================================================================
Draw a plot to compare the true relationship to OLS predictions:
prstd, iv_l, iv_u = wls_prediction_std(res2)
plt.figure()
plt.plot(x, y, 'o', x, y_true, 'b-')
plt.plot(x, res2.fittedvalues, 'r--.')
plt.plot(x, iv_u, 'r--')
plt.plot(x, iv_l, 'r--')
plt.title('blue: true, red: OLS')
<matplotlib.text.Text at 0x46d2bd0>
R = [[0, 1, 0, 0], [0, 0, 1, 0]]
print np.array(R)
print res2.f_test(R)
[[0 1 0 0] [0 0 1 0]] <F test: F=array([[ 145.49268198]]), p=[[ 0.]], df_denom=46, df_num=2>
If we generate artificial data with smaller group effects, the T test can no longer reject the Null hypothesis:
beta = [1., 0.3, -0.0, 10]
y_true = np.dot(X, beta)
y = y_true + np.random.normal(size=nsample)
res3 = sm.OLS(y, X).fit()
print res3.f_test(R)
<F test: F=array([[ 1.22491119]]), p=[[ 0.30318644]], df_denom=46, df_num=2>
The Longley dataset is well known to have high multicollinearity, that is, the exogenous predictors are highly correlated. This is problematic because it can affect the stability of our coefficient estimates as we make minor changes to model specification.
from statsmodels.datasets.longley import load
y = load().endog
X = load().exog
X = sm.tools.add_constant(X)
Fit and summary:
ols_model = sm.OLS(y, X)
ols_results = ols_model.fit()
print ols_results.summary()
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.995 Model: OLS Adj. R-squared: 0.992 Method: Least Squares F-statistic: 330.3 Date: Sun, 26 Aug 2012 Prob (F-statistic): 4.98e-10 Time: 20:52:53 Log-Likelihood: -109.62 No. Observations: 16 AIC: 233.2 Df Residuals: 9 BIC: 238.6 Df Model: 6 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ x1 15.0619 84.915 0.177 0.863 -177.029 207.153 x2 -0.0358 0.033 -1.070 0.313 -0.112 0.040 x3 -2.0202 0.488 -4.136 0.003 -3.125 -0.915 x4 -1.0332 0.214 -4.822 0.001 -1.518 -0.549 x5 -0.0511 0.226 -0.226 0.826 -0.563 0.460 x6 1829.1515 455.478 4.016 0.003 798.788 2859.515 const -3.482e+06 8.9e+05 -3.911 0.004 -5.5e+06 -1.47e+06 ============================================================================== Omnibus: 0.749 Durbin-Watson: 2.559 Prob(Omnibus): 0.688 Jarque-Bera (JB): 0.684 Skew: 0.420 Prob(JB): 0.710 Kurtosis: 2.434 Cond. No. 4.86e+09 ============================================================================== The condition number is large, 4.86e+09. This might indicate that there are strong multicollinearity or other numerical problems.
/usr/local/lib/python2.7/dist-packages/scipy/stats/stats.py:1199: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=16 int(n))
One way to assess multicollinearity is to compute the condition number. Values over 20 are worrisome (see Greene 4.9). The first step is to normalize the independent variables to have unit length:
norm_x = np.ones_like(X)
for i in range(int(ols_model.df_model)):
norm_x[:,i] = X[:,i]/np.linalg.norm(X[:,i])
norm_xtx = np.dot(norm_x.T,norm_x)
Then, we take the square root of the ratio of the biggest to the smallest eigen values.
eigs = np.linalg.eigvals(norm_xtx)
condition_number = np.sqrt(eigs.max() / eigs.min())
print condition_number
56240.8699497
Greene also points out that dropping a single observation can have a dramatic effect on the coefficient estimates:
ols_results2 = sm.OLS(y[:-1], X[:-1,:]).fit()
print "Percentage change %4.2f%%\n"*7 % tuple([i for i in ols_results.params/ols_results2.params*100 - 100])
Percentage change -173.43% Percentage change 31.04% Percentage change 3.48% Percentage change 7.83% Percentage change -199.54% Percentage change 15.39% Percentage change 15.40%