As demonstrated in the Data Analyst Nanodegree Webcast on Multicollinearity in Linear Regression by Charlie and Stephen on Tuesday 16th June 2015.
Run the code locally (or on your own dataset) to investigate multicollinearity yourself. It's a good idea to keep track of the features you include, R^2 value, and the multicollinearity issues that you observe.
#Import the useful data science packages!
import numpy as np
import pandas as pd
import statsmodels.api as sm
#Load the data
prosper = pd.read_csv('/Users/charlie/Downloads/prosperLoanData.csv')
# you can download the dataset from https://docs.google.com/document/d/1w7KhqotVi5eoKE3I_AZHbsxdr-NmcWsLTIiZrpxWx4w/pub and then
# run all this locally.
prosper.columns
# Normalisation function used to ensure that each numerical variable has mean = 0
# and standard deviation = 1. Does the same as the function in Lesson 3 of Intro to DS.
def normalise(data):
mean = data.mean()
stdev = data.std()
return (data - mean)/stdev
# Choose some of the many columns from the dataset. We're going to attempt to predict the
# 'LoanOriginalAmount' from some of the other data.
prosper = prosper[['CreditScoreRangeLower','StatedMonthlyIncome', \
'IsBorrowerHomeowner', 'CreditScoreRangeUpper',\
'EmploymentStatus','Term','BorrowerRate','LenderYield',\
'LoanOriginalAmount']]
# Select just the numerical variables, we'll normalise these and we'll be creating dummy variables
# from the categorical variables.
numerical_variables = ['CreditScoreRangeLower','StatedMonthlyIncome',\
'Term','CreditScoreRangeUpper','BorrowerRate',\
'LenderYield','LoanOriginalAmount']
#just remove the missing data and any duplication for simplicity!
prosper.dropna(inplace = True)
prosper.drop_duplicates(inplace = True)
#choose the numerical variables from prosper, remove the target to create features
features = prosper[numerical_variables].drop(['LoanOriginalAmount'],axis = 1)
#normalising numerical features improves the performance of fitting algorithms
# (don't normalise the dummy variables though, that's generally a bad idea!)
features = normalise(features)
#create dataframes of homeowner and employment dummies
home_dum = pd.get_dummies(prosper.IsBorrowerHomeowner,prefix="homeowner")
job_dum = pd.get_dummies(prosper.EmploymentStatus,prefix = "job")
Interact with the following cell to adjust your model. Uncomment/comment rows to use them or use
features.drop([ColumnName],axis=1,inplace=True)
to drop a column.
# uncomment to add a constant column
#features = sm.add_constant(features)
# uncomment these to add the dummy variables to the features
#features = features.join(job_dum)
#features = features.join(home_dum)
# uncomment these to drop a single dummy variable from each full set
# (but only if you've previously added them!)
#features.drop(['job_Employed'],axis=1,inplace=True)
#features.drop(['homeowner_True'],axis = 1,inplace=True)
# set the target values to fit the linear regression model
values = prosper.LoanOriginalAmount
# Watch out for strongly correlated features!
features.corr()
# create, fit and summarise the model
# check out the coefficients and the condition number to look for multicollinearity
# A good resource for understanding all of this summary output can be found in the excellent
# online statistics textbook here: http://work.thaslwanter.at/Stats/html/statsModels.html#linear-regression-analysis-with-python
sm.OLS(values,features).fit().summary()
Dep. Variable: | LoanOriginalAmount | R-squared: | 0.298 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.298 |
Method: | Least Squares | F-statistic: | 3824. |
Date: | Tue, 16 Jun 2015 | Prob (F-statistic): | 0.00 |
Time: | 18:03:43 | Log-Likelihood: | -1.0772e+06 |
No. Observations: | 107893 | AIC: | 2.154e+06 |
Df Residuals: | 107880 | BIC: | 2.155e+06 |
Df Model: | 12 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
const | 9746.5002 | 27.336 | 356.541 | 0.000 | 9692.922 9800.079 |
CreditScoreRangeLower | 644.4733 | 20.641 | 31.224 | 0.000 | 604.018 684.928 |
StatedMonthlyIncome | 813.8288 | 16.268 | 50.025 | 0.000 | 781.943 845.715 |
Term | 1690.2776 | 16.863 | 100.236 | 0.000 | 1657.227 1723.329 |
LenderYield | -1630.9804 | 18.663 | -87.392 | 0.000 | -1667.559 -1594.401 |
job_Full-time | -2155.2679 | 41.243 | -52.258 | 0.000 | -2236.103 -2074.433 |
job_Not available | -1902.1674 | 81.420 | -23.362 | 0.000 | -2061.750 -1742.585 |
job_Not employed | -2156.2083 | 198.716 | -10.851 | 0.000 | -2545.689 -1766.727 |
job_Other | -1702.3283 | 88.973 | -19.133 | 0.000 | -1876.714 -1527.943 |
job_Part-time | -3515.9629 | 161.795 | -21.731 | 0.000 | -3833.079 -3198.847 |
job_Retired | -3358.6644 | 187.640 | -17.899 | 0.000 | -3726.436 -2990.892 |
job_Self-employed | -720.5137 | 71.666 | -10.054 | 0.000 | -860.977 -580.050 |
homeowner_False | -1080.1679 | 33.811 | -31.947 | 0.000 | -1146.437 -1013.898 |
Omnibus: | 29859.330 | Durbin-Watson: | 2.007 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 1384223.160 |
Skew: | 0.572 | Prob(JB): | 0.00 |
Kurtosis: | 20.510 | Cond. No. | 15.7 |