As demonstrated in the Data Analyst Nanodegree Webcast on Multicollinearity in Linear Regression by Charlie and Stephen on Tuesday 16th June 2015.

Run the code locally (or on your own dataset) to investigate multicollinearity yourself. It's a good idea to keep track of the features you include, R^2 value, and the multicollinearity issues that you observe.

In [1]:
#Import the useful data science packages!
import numpy as np
import pandas as pd
import statsmodels.api as sm

In [2]:
#Load the data
# run all this locally.

In [ ]:
prosper.columns

In [53]:
# Normalisation function used to ensure that each numerical variable has mean = 0
# and standard deviation = 1. Does the same as the function in Lesson 3 of Intro to DS.
def normalise(data):
mean = data.mean()
stdev = data.std()
return (data - mean)/stdev

In [54]:
# Choose some of the many columns from the dataset. We're going to attempt to predict the
# 'LoanOriginalAmount' from some of the other data.
prosper = prosper[['CreditScoreRangeLower','StatedMonthlyIncome', \
'IsBorrowerHomeowner', 'CreditScoreRangeUpper',\
'EmploymentStatus','Term','BorrowerRate','LenderYield',\
'LoanOriginalAmount']]

In [65]:
# Select just the numerical variables, we'll normalise these and we'll be creating dummy variables
# from the categorical variables.
numerical_variables = ['CreditScoreRangeLower','StatedMonthlyIncome',\
'Term','CreditScoreRangeUpper','BorrowerRate',\
'LenderYield','LoanOriginalAmount']

In [77]:
#just remove the missing data and any duplication for simplicity!
prosper.dropna(inplace = True)
prosper.drop_duplicates(inplace = True)

#choose the numerical variables from prosper, remove the target to create features
features = prosper[numerical_variables].drop(['LoanOriginalAmount'],axis = 1)
#normalising numerical features improves the performance of fitting algorithms
# (don't normalise the dummy variables though, that's generally a bad idea!)
features = normalise(features)

#create dataframes of homeowner and employment dummies
home_dum = pd.get_dummies(prosper.IsBorrowerHomeowner,prefix="homeowner")
job_dum = pd.get_dummies(prosper.EmploymentStatus,prefix = "job")


Interact with the following cell to adjust your model. Uncomment/comment rows to use them or use

features.drop([ColumnName],axis=1,inplace=True)

to drop a column.

In [ ]:
# uncomment to add a constant column

# uncomment these to add the dummy variables to the features
#features = features.join(job_dum)
#features = features.join(home_dum)

# uncomment these to drop a single dummy variable from each full set
# (but only if you've previously added them!)
#features.drop(['job_Employed'],axis=1,inplace=True)
#features.drop(['homeowner_True'],axis = 1,inplace=True)

# set the target values to fit the linear regression model
values = prosper.LoanOriginalAmount

In [ ]:
# Watch out for strongly correlated features!
features.corr()

In [78]:
# create, fit and summarise the model
# check out the coefficients and the condition number to look for multicollinearity

# A good resource for understanding all of this summary output can be found in the excellent
# online statistics textbook here: http://work.thaslwanter.at/Stats/html/statsModels.html#linear-regression-analysis-with-python
sm.OLS(values,features).fit().summary()

Out[78]:
Dep. Variable: R-squared: LoanOriginalAmount 0.298 OLS 0.298 Least Squares 3824. Tue, 16 Jun 2015 0.00 18:03:43 -1.0772e+06 107893 2.154e+06 107880 2.155e+06 12 nonrobust
coef std err t P>|t| [95.0% Conf. Int.] 9746.5002 27.336 356.541 0.000 9692.922 9800.079 644.4733 20.641 31.224 0.000 604.018 684.928 813.8288 16.268 50.025 0.000 781.943 845.715 1690.2776 16.863 100.236 0.000 1657.227 1723.329 -1630.9804 18.663 -87.392 0.000 -1667.559 -1594.401 -2155.2679 41.243 -52.258 0.000 -2236.103 -2074.433 -1902.1674 81.420 -23.362 0.000 -2061.750 -1742.585 -2156.2083 198.716 -10.851 0.000 -2545.689 -1766.727 -1702.3283 88.973 -19.133 0.000 -1876.714 -1527.943 -3515.9629 161.795 -21.731 0.000 -3833.079 -3198.847 -3358.6644 187.640 -17.899 0.000 -3726.436 -2990.892 -720.5137 71.666 -10.054 0.000 -860.977 -580.050 -1080.1679 33.811 -31.947 0.000 -1146.437 -1013.898
 Omnibus: Durbin-Watson: 29859.3 2.007 0 1.38422e+06 0.572 0 20.51 15.7
In [ ]: