Data Analysis Assignment¶

In [71]:

import pandas as pd
import numpy as np
from statsmodels.formula.api import ols
from statsmodels.graphics.gofplots import qqplot

Load data and perform some cleanings. Of the 2,500 observations, we identified two that contain missing values, which were simply removed, because we expect that such few occurrences will not affect the analysis results significantly.

In [93]:

loansData = pd.read_csv('https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv')

loansData = loansData.reset_index(drop=True)

loansData.columns = map(lambda x: x.replace('.', '_').lower(), loansData.columns)

loansData['amount_requested'] = loansData['amount_requested'].astype(float)
loansData['interest_rate'] = loansData['interest_rate'].apply(lambda x: float(x.strip('%')))
loansData['debt_to_income_ratio'] = loansData['debt_to_income_ratio'].apply(lambda x: float(x.strip('%')) / 100)
loansData['loan_length'] = loansData['loan_length'].apply(lambda x: float(x.strip('months').strip()))

loansData = loansData.dropna()

In [94]:

loansData['interest_rate'].describe()

Out[94]:

count    2498.000000
mean       13.070368
std         4.178007
min         5.420000
25%        10.160000
50%        13.110000
75%        15.800000
max        24.890000

The interest rates are evenly distributed, with near-bell-shape distribution centered at around 13%. Based on this observation we identified that no transformation is required on the interest rate.

In [97]:

loansData.interest_rate.plot(kind='kde');

We first check the correlation between all numerical variables in the data. We exclude the amount of granted loan from further analysis because it is highly correlated with the amount of requested loan, which, common sense says, might affect the interest rate of the loan. Our correlation table shows that it is indeed the case. We further identified other variables that could be useful for further analysis based on their relative magnitudes of correlation with respect to the interest rate.

In [95]:

loansData.corr()

Out[95]:

	amount_requested	amount_funded_by_investors	interest_rate	loan_length	debt_to_income_ratio	monthly_income	open_credit_lines	revolving_credit_balance	inquiries_in_the_last_6_months
amount_requested	1.000000	0.969841	0.331831	0.412298	0.081288	0.391179	0.195942	0.293365	-0.029561
amount_funded_by_investors	0.969841	1.000000	0.336830	0.406218	0.093679	0.374377	0.185826	0.261892	-0.067757
interest_rate	0.331831	0.336830	1.000000	0.423506	0.172203	0.012921	0.090307	0.061109	0.164646
loan_length	0.412298	0.406218	0.423506	1.000000	0.024991	0.074545	0.040890	0.055436	0.023842
debt_to_income_ratio	0.081288	0.093679	0.172203	0.024991	1.000000	-0.162345	0.370850	0.189221	0.011979
monthly_income	0.391179	0.374377	0.012921	0.074545	-0.162345	1.000000	0.171396	0.359684	0.033948
open_credit_lines	0.195942	0.185826	0.090307	0.040890	0.370850	0.171396	1.000000	0.290085	0.110736
revolving_credit_balance	0.293365	0.261892	0.061109	0.055436	0.189221	0.359684	0.290085	1.000000	0.012186
inquiries_in_the_last_6_months	-0.029561	-0.067757	0.164646	0.023842	0.011979	0.033948	0.110736	0.012186	1.000000

Some exploratory analyses were needed to determine the significance of the categorical variables. We identified that the loan length has a stronger effect on the interest rate compared to other categorical variables. We can verify this by looking at boxplots of the interest rate grouped by the loan lengths. A scatter plot between the FICO range and interest rate, grouped by loan lengths, also shows some clustering, although there are plenty of overlaps.

In [96]:

loansData.boxplot(column='interest_rate');

In [98]:

props = loansData.loan_length.value_counts().astype(float)/len(loansData.loan_length)
loansData.boxplot(column='interest_rate', by='loan_length', widths=props);

In [99]:

grouped = loansData.groupby('loan_length')
fr = pd.Categorical.from_array(loansData.fico_range)

cols = ['Yellow', 'MediumOrchid']

i = 0
for length, group in grouped:
    scatter(pd.Categorical.from_array(group.fico_range).labels, group.interest_rate, c=cols[i].strip().strip('\n'), label=str(int(length)) + ' months', s=30)
    i += 1
    
a = gca()
a.set_xticks(np.unique(fr.labels))
a.set_xticklabels(fr.levels, rotation='30')
a.set_xlabel('FICO range', fontsize=12)
a.set_ylabel('Interest rate', fontsize=12)
a.legend()

f = gcf()
f.set_size_inches(12,7)
f.suptitle('Interest rate as a function of FICO range and loan length', fontsize=15)
f.savefig('interest.png');

We fit a regression model relating interest rate to loan length and FICO range. Although we could still improve the fit by including the numerical variables, we choose not to do it to make a simple and tractable model. We also don’t consider interaction between loan length and FICO range.

In [100]:

lm = ols('interest_rate ~ C(loan_length) + fico_range', loansData).fit()
lm.summary()

Out[100]:

OLS Regression Results
Dep. Variable:	interest_rate	R-squared:	0.731
Model:	OLS	Adj. R-squared:	0.727
Method:	Least Squares	F-statistic:	175.7
Date:	Mon, 18 Feb 2013	Prob (F-statistic):	0.00
Time:	01:16:09	Log-Likelihood:	-5476.3
No. Observations:	2498	AIC:	1.103e+04
Df Residuals:	2459	BIC:	1.126e+04
Df Model:	38

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	15.2120	0.977	15.574	0.000	13.297 17.127
C(loan_length)[T.60.0]	4.3797	0.106	41.193	0.000	4.171 4.588
fico_range[T.645-649]	-0.3287	1.595	-0.206	0.837	-3.456 2.799
fico_range[T.650-654]	-0.0820	2.393	-0.034	0.973	-4.774 4.610
fico_range[T.655-659]	-0.2820	1.465	-0.192	0.847	-3.155 2.591
fico_range[T.660-664]	2.1943	0.996	2.202	0.028	0.240 4.148
fico_range[T.665-669]	1.3299	0.994	1.338	0.181	-0.619 3.279
fico_range[T.670-674]	0.3449	0.991	0.348	0.728	-1.599 2.288
fico_range[T.675-679]	-0.2280	0.992	-0.230	0.818	-2.172 1.716
fico_range[T.680-684]	-1.0059	0.992	-1.014	0.311	-2.952 0.940
fico_range[T.685-689]	-1.4203	0.995	-1.428	0.153	-3.371 0.530
fico_range[T.690-694]	-1.5749	0.994	-1.584	0.113	-3.525 0.375
fico_range[T.695-699]	-2.1815	0.993	-2.197	0.028	-4.129 -0.234
fico_range[T.700-704]	-2.9584	0.996	-2.971	0.003	-4.911 -1.006
fico_range[T.705-709]	-3.5981	0.995	-3.616	0.000	-5.549 -1.647
fico_range[T.710-714]	-3.7565	0.999	-3.762	0.000	-5.715 -1.798
fico_range[T.715-719]	-4.9249	1.003	-4.910	0.000	-6.892 -2.958
fico_range[T.720-724]	-5.1367	0.998	-5.146	0.000	-7.094 -3.179
fico_range[T.725-729]	-5.4909	1.003	-5.476	0.000	-7.457 -3.525
fico_range[T.730-734]	-6.2809	1.003	-6.264	0.000	-8.247 -4.315
fico_range[T.735-739]	-6.5318	1.014	-6.442	0.000	-8.520 -4.544
fico_range[T.740-744]	-6.6940	1.022	-6.549	0.000	-8.698 -4.690
fico_range[T.745-749]	-6.8514	1.022	-6.706	0.000	-8.855 -4.848
fico_range[T.750-754]	-7.4623	1.016	-7.344	0.000	-9.455 -5.470
fico_range[T.755-759]	-7.4537	1.029	-7.244	0.000	-9.471 -5.436
fico_range[T.760-764]	-7.3461	1.029	-7.141	0.000	-9.363 -5.329
fico_range[T.765-769]	-8.0370	1.043	-7.709	0.000	-10.081 -5.993
fico_range[T.770-774]	-8.7167	1.111	-7.844	0.000	-10.896 -6.538
fico_range[T.775-779]	-7.4633	1.082	-6.895	0.000	-9.586 -5.341
fico_range[T.780-784]	-8.7177	1.061	-8.218	0.000	-10.798 -6.638
fico_range[T.785-789]	-8.0972	1.098	-7.372	0.000	-10.251 -5.943
fico_range[T.790-794]	-8.3085	1.092	-7.607	0.000	-10.450 -6.167
fico_range[T.795-799]	-8.8419	1.150	-7.686	0.000	-11.098 -6.586
fico_range[T.800-804]	-8.6511	1.163	-7.439	0.000	-10.931 -6.371
fico_range[T.805-809]	-8.5820	1.178	-7.284	0.000	-10.892 -6.272
fico_range[T.810-814]	-6.9183	1.245	-5.556	0.000	-9.360 -4.477
fico_range[T.815-819]	-9.0170	1.323	-6.817	0.000	-11.611 -6.423
fico_range[T.820-824]	-7.3120	2.393	-3.056	0.002	-12.004 -2.620
fico_range[T.830-834]	-7.5920	2.393	-3.173	0.002	-12.284 -2.900

Omnibus:	130.177	Durbin-Watson:	1.922
Prob(Omnibus):	0.000	Jarque-Bera (JB):	181.842
Skew:	0.477	Prob(JB):	3.26e-40
Kurtosis:	3.916	Cond. No.	146.

In [101]:

lm.pvalues

Out[101]:

Intercept                  3.159062e-52
C(loan_length)[T.60.0]    1.557353e-282
fico_range[T.645-649]      8.367674e-01
fico_range[T.650-654]      9.726628e-01
fico_range[T.655-659]      8.473899e-01
fico_range[T.660-664]      2.775211e-02
fico_range[T.665-669]      1.809146e-01
fico_range[T.670-674]      7.278396e-01
fico_range[T.675-679]      8.181856e-01
fico_range[T.680-684]      3.109028e-01
fico_range[T.685-689]      1.534355e-01
fico_range[T.690-694]      1.133903e-01
fico_range[T.695-699]      2.811550e-02
fico_range[T.700-704]      2.992288e-03
fico_range[T.705-709]      3.057081e-04
fico_range[T.710-714]      1.727181e-04
fico_range[T.715-719]      9.680745e-07
fico_range[T.720-724]      2.875148e-07
fico_range[T.725-729]      4.785347e-08
fico_range[T.730-734]      4.420753e-10
fico_range[T.735-739]      1.411781e-10
fico_range[T.740-744]      7.028897e-11
fico_range[T.745-749]      2.473532e-11
fico_range[T.750-754]      2.813765e-13
fico_range[T.755-759]      5.797948e-13
fico_range[T.760-764]      1.211201e-12
fico_range[T.765-769]      1.823367e-14
fico_range[T.770-774]      6.428102e-15
fico_range[T.775-779]      6.805091e-12
fico_range[T.780-784]      3.294661e-16
fico_range[T.785-789]      2.278067e-13
fico_range[T.790-794]      3.962034e-14
fico_range[T.795-799]      2.181115e-14
fico_range[T.800-804]      1.391379e-13
fico_range[T.805-809]      4.343024e-13
fico_range[T.810-814]      3.054587e-08
fico_range[T.815-819]      1.163922e-11
fico_range[T.820-824]      2.266550e-03
fico_range[T.830-834]      1.526738e-03

Finally we check that the residual error is normally distributed, i.e. no non-random patterns of variation in the residuals.

In [102]:

qqplot(lm.resid, line='s');

In [ ]: