As demonstrated in the Data Analyst Nanodegree Webcast on Multicollinearity in Linear Regression by Charlie and Stephen on Tuesday 16th June 2015.

Run the code locally (or on your own dataset) to investigate multicollinearity yourself. It's a good idea to keep track of the features you include, R^2 value, and the multicollinearity issues that you observe.

In [1]:

#Import the useful data science packages!
import numpy as np
import pandas as pd
import statsmodels.api as sm

In [2]:

#Load the data
prosper = pd.read_csv('/Users/charlie/Downloads/prosperLoanData.csv')
# you can download the dataset from https://docs.google.com/document/d/1w7KhqotVi5eoKE3I_AZHbsxdr-NmcWsLTIiZrpxWx4w/pub and then
# run all this locally.

In [ ]:

prosper.columns

In [53]:

# Normalisation function used to ensure that each numerical variable has mean = 0 
# and standard deviation = 1. Does the same as the function in Lesson 3 of Intro to DS.
def normalise(data):
    mean = data.mean()
    stdev = data.std()
    return (data - mean)/stdev

In [54]:

# Choose some of the many columns from the dataset. We're going to attempt to predict the 
# 'LoanOriginalAmount' from some of the other data.
prosper = prosper[['CreditScoreRangeLower','StatedMonthlyIncome', \
                   'IsBorrowerHomeowner', 'CreditScoreRangeUpper',\
                   'EmploymentStatus','Term','BorrowerRate','LenderYield',\
                   'LoanOriginalAmount']]

In [65]:

# Select just the numerical variables, we'll normalise these and we'll be creating dummy variables
# from the categorical variables.
numerical_variables = ['CreditScoreRangeLower','StatedMonthlyIncome',\
                       'Term','CreditScoreRangeUpper','BorrowerRate',\
                       'LenderYield','LoanOriginalAmount']

In [77]:

#just remove the missing data and any duplication for simplicity!
prosper.dropna(inplace = True)
prosper.drop_duplicates(inplace = True)

#choose the numerical variables from prosper, remove the target to create features
features = prosper[numerical_variables].drop(['LoanOriginalAmount'],axis = 1)
#normalising numerical features improves the performance of fitting algorithms 
# (don't normalise the dummy variables though, that's generally a bad idea!)
features = normalise(features)

#create dataframes of homeowner and employment dummies 
home_dum = pd.get_dummies(prosper.IsBorrowerHomeowner,prefix="homeowner")
job_dum = pd.get_dummies(prosper.EmploymentStatus,prefix = "job")

Interact with the following cell to adjust your model. Uncomment/comment rows to use them or use

features.drop([ColumnName],axis=1,inplace=True)

to drop a column.

In [ ]:

# uncomment to add a constant column
#features = sm.add_constant(features)

# uncomment these to add the dummy variables to the features
#features = features.join(job_dum)
#features = features.join(home_dum)

# uncomment these to drop a single dummy variable from each full set 
# (but only if you've previously added them!)
#features.drop(['job_Employed'],axis=1,inplace=True)
#features.drop(['homeowner_True'],axis = 1,inplace=True)

# set the target values to fit the linear regression model
values = prosper.LoanOriginalAmount

In [ ]:

# Watch out for strongly correlated features!
features.corr()

In [78]:

# create, fit and summarise the model
# check out the coefficients and the condition number to look for multicollinearity

# A good resource for understanding all of this summary output can be found in the excellent
# online statistics textbook here: http://work.thaslwanter.at/Stats/html/statsModels.html#linear-regression-analysis-with-python
sm.OLS(values,features).fit().summary()

Out[78]:

OLS Regression Results
Dep. Variable:	LoanOriginalAmount	R-squared:	0.298
Model:	OLS	Adj. R-squared:	0.298
Method:	Least Squares	F-statistic:	3824.
Date:	Tue, 16 Jun 2015	Prob (F-statistic):	0.00
Time:	18:03:43	Log-Likelihood:	-1.0772e+06
No. Observations:	107893	AIC:	2.154e+06
Df Residuals:	107880	BIC:	2.155e+06
Df Model:	12
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
const	9746.5002	27.336	356.541	0.000	9692.922 9800.079
CreditScoreRangeLower	644.4733	20.641	31.224	0.000	604.018 684.928
StatedMonthlyIncome	813.8288	16.268	50.025	0.000	781.943 845.715
Term	1690.2776	16.863	100.236	0.000	1657.227 1723.329
LenderYield	-1630.9804	18.663	-87.392	0.000	-1667.559 -1594.401
job_Full-time	-2155.2679	41.243	-52.258	0.000	-2236.103 -2074.433
job_Not available	-1902.1674	81.420	-23.362	0.000	-2061.750 -1742.585
job_Not employed	-2156.2083	198.716	-10.851	0.000	-2545.689 -1766.727
job_Other	-1702.3283	88.973	-19.133	0.000	-1876.714 -1527.943
job_Part-time	-3515.9629	161.795	-21.731	0.000	-3833.079 -3198.847
job_Retired	-3358.6644	187.640	-17.899	0.000	-3726.436 -2990.892
job_Self-employed	-720.5137	71.666	-10.054	0.000	-860.977 -580.050
homeowner_False	-1080.1679	33.811	-31.947	0.000	-1146.437 -1013.898

Omnibus:	29859.330	Durbin-Watson:	2.007
Prob(Omnibus):	0.000	Jarque-Bera (JB):	1384223.160
Skew:	0.572	Prob(JB):	0.00
Kurtosis:	20.510	Cond. No.	15.7

In [ ]: