import pandas as pd
import numpy as np
from statsmodels.formula.api import ols
from statsmodels.graphics.gofplots import qqplot
Load data and perform some cleanings. Of the 2,500 observations, we identified two that contain missing values, which were simply removed, because we expect that such few occurrences will not affect the analysis results significantly.
loansData = pd.read_csv('https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv')
loansData = loansData.reset_index(drop=True)
loansData.columns = map(lambda x: x.replace('.', '_').lower(), loansData.columns)
loansData['amount_requested'] = loansData['amount_requested'].astype(float)
loansData['interest_rate'] = loansData['interest_rate'].apply(lambda x: float(x.strip('%')))
loansData['debt_to_income_ratio'] = loansData['debt_to_income_ratio'].apply(lambda x: float(x.strip('%')) / 100)
loansData['loan_length'] = loansData['loan_length'].apply(lambda x: float(x.strip('months').strip()))
loansData = loansData.dropna()
loansData['interest_rate'].describe()
count 2498.000000 mean 13.070368 std 4.178007 min 5.420000 25% 10.160000 50% 13.110000 75% 15.800000 max 24.890000
The interest rates are evenly distributed, with near-bell-shape distribution centered at around 13%. Based on this observation we identified that no transformation is required on the interest rate.
loansData.interest_rate.plot(kind='kde');
We first check the correlation between all numerical variables in the data. We exclude the amount of granted loan from further analysis because it is highly correlated with the amount of requested loan, which, common sense says, might affect the interest rate of the loan. Our correlation table shows that it is indeed the case. We further identified other variables that could be useful for further analysis based on their relative magnitudes of correlation with respect to the interest rate.
loansData.corr()
amount_requested | amount_funded_by_investors | interest_rate | loan_length | debt_to_income_ratio | monthly_income | open_credit_lines | revolving_credit_balance | inquiries_in_the_last_6_months | |
---|---|---|---|---|---|---|---|---|---|
amount_requested | 1.000000 | 0.969841 | 0.331831 | 0.412298 | 0.081288 | 0.391179 | 0.195942 | 0.293365 | -0.029561 |
amount_funded_by_investors | 0.969841 | 1.000000 | 0.336830 | 0.406218 | 0.093679 | 0.374377 | 0.185826 | 0.261892 | -0.067757 |
interest_rate | 0.331831 | 0.336830 | 1.000000 | 0.423506 | 0.172203 | 0.012921 | 0.090307 | 0.061109 | 0.164646 |
loan_length | 0.412298 | 0.406218 | 0.423506 | 1.000000 | 0.024991 | 0.074545 | 0.040890 | 0.055436 | 0.023842 |
debt_to_income_ratio | 0.081288 | 0.093679 | 0.172203 | 0.024991 | 1.000000 | -0.162345 | 0.370850 | 0.189221 | 0.011979 |
monthly_income | 0.391179 | 0.374377 | 0.012921 | 0.074545 | -0.162345 | 1.000000 | 0.171396 | 0.359684 | 0.033948 |
open_credit_lines | 0.195942 | 0.185826 | 0.090307 | 0.040890 | 0.370850 | 0.171396 | 1.000000 | 0.290085 | 0.110736 |
revolving_credit_balance | 0.293365 | 0.261892 | 0.061109 | 0.055436 | 0.189221 | 0.359684 | 0.290085 | 1.000000 | 0.012186 |
inquiries_in_the_last_6_months | -0.029561 | -0.067757 | 0.164646 | 0.023842 | 0.011979 | 0.033948 | 0.110736 | 0.012186 | 1.000000 |
Some exploratory analyses were needed to determine the significance of the categorical variables. We identified that the loan length has a stronger effect on the interest rate compared to other categorical variables. We can verify this by looking at boxplots of the interest rate grouped by the loan lengths. A scatter plot between the FICO range and interest rate, grouped by loan lengths, also shows some clustering, although there are plenty of overlaps.
loansData.boxplot(column='interest_rate');
props = loansData.loan_length.value_counts().astype(float)/len(loansData.loan_length)
loansData.boxplot(column='interest_rate', by='loan_length', widths=props);
grouped = loansData.groupby('loan_length')
fr = pd.Categorical.from_array(loansData.fico_range)
cols = ['Yellow', 'MediumOrchid']
i = 0
for length, group in grouped:
scatter(pd.Categorical.from_array(group.fico_range).labels, group.interest_rate, c=cols[i].strip().strip('\n'), label=str(int(length)) + ' months', s=30)
i += 1
a = gca()
a.set_xticks(np.unique(fr.labels))
a.set_xticklabels(fr.levels, rotation='30')
a.set_xlabel('FICO range', fontsize=12)
a.set_ylabel('Interest rate', fontsize=12)
a.legend()
f = gcf()
f.set_size_inches(12,7)
f.suptitle('Interest rate as a function of FICO range and loan length', fontsize=15)
f.savefig('interest.png');
We fit a regression model relating interest rate to loan length and FICO range. Although we could still improve the fit by including the numerical variables, we choose not to do it to make a simple and tractable model. We also don’t consider interaction between loan length and FICO range.
lm = ols('interest_rate ~ C(loan_length) + fico_range', loansData).fit()
lm.summary()
Dep. Variable: | interest_rate | R-squared: | 0.731 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.727 |
Method: | Least Squares | F-statistic: | 175.7 |
Date: | Mon, 18 Feb 2013 | Prob (F-statistic): | 0.00 |
Time: | 01:16:09 | Log-Likelihood: | -5476.3 |
No. Observations: | 2498 | AIC: | 1.103e+04 |
Df Residuals: | 2459 | BIC: | 1.126e+04 |
Df Model: | 38 |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 15.2120 | 0.977 | 15.574 | 0.000 | 13.297 17.127 |
C(loan_length)[T.60.0] | 4.3797 | 0.106 | 41.193 | 0.000 | 4.171 4.588 |
fico_range[T.645-649] | -0.3287 | 1.595 | -0.206 | 0.837 | -3.456 2.799 |
fico_range[T.650-654] | -0.0820 | 2.393 | -0.034 | 0.973 | -4.774 4.610 |
fico_range[T.655-659] | -0.2820 | 1.465 | -0.192 | 0.847 | -3.155 2.591 |
fico_range[T.660-664] | 2.1943 | 0.996 | 2.202 | 0.028 | 0.240 4.148 |
fico_range[T.665-669] | 1.3299 | 0.994 | 1.338 | 0.181 | -0.619 3.279 |
fico_range[T.670-674] | 0.3449 | 0.991 | 0.348 | 0.728 | -1.599 2.288 |
fico_range[T.675-679] | -0.2280 | 0.992 | -0.230 | 0.818 | -2.172 1.716 |
fico_range[T.680-684] | -1.0059 | 0.992 | -1.014 | 0.311 | -2.952 0.940 |
fico_range[T.685-689] | -1.4203 | 0.995 | -1.428 | 0.153 | -3.371 0.530 |
fico_range[T.690-694] | -1.5749 | 0.994 | -1.584 | 0.113 | -3.525 0.375 |
fico_range[T.695-699] | -2.1815 | 0.993 | -2.197 | 0.028 | -4.129 -0.234 |
fico_range[T.700-704] | -2.9584 | 0.996 | -2.971 | 0.003 | -4.911 -1.006 |
fico_range[T.705-709] | -3.5981 | 0.995 | -3.616 | 0.000 | -5.549 -1.647 |
fico_range[T.710-714] | -3.7565 | 0.999 | -3.762 | 0.000 | -5.715 -1.798 |
fico_range[T.715-719] | -4.9249 | 1.003 | -4.910 | 0.000 | -6.892 -2.958 |
fico_range[T.720-724] | -5.1367 | 0.998 | -5.146 | 0.000 | -7.094 -3.179 |
fico_range[T.725-729] | -5.4909 | 1.003 | -5.476 | 0.000 | -7.457 -3.525 |
fico_range[T.730-734] | -6.2809 | 1.003 | -6.264 | 0.000 | -8.247 -4.315 |
fico_range[T.735-739] | -6.5318 | 1.014 | -6.442 | 0.000 | -8.520 -4.544 |
fico_range[T.740-744] | -6.6940 | 1.022 | -6.549 | 0.000 | -8.698 -4.690 |
fico_range[T.745-749] | -6.8514 | 1.022 | -6.706 | 0.000 | -8.855 -4.848 |
fico_range[T.750-754] | -7.4623 | 1.016 | -7.344 | 0.000 | -9.455 -5.470 |
fico_range[T.755-759] | -7.4537 | 1.029 | -7.244 | 0.000 | -9.471 -5.436 |
fico_range[T.760-764] | -7.3461 | 1.029 | -7.141 | 0.000 | -9.363 -5.329 |
fico_range[T.765-769] | -8.0370 | 1.043 | -7.709 | 0.000 | -10.081 -5.993 |
fico_range[T.770-774] | -8.7167 | 1.111 | -7.844 | 0.000 | -10.896 -6.538 |
fico_range[T.775-779] | -7.4633 | 1.082 | -6.895 | 0.000 | -9.586 -5.341 |
fico_range[T.780-784] | -8.7177 | 1.061 | -8.218 | 0.000 | -10.798 -6.638 |
fico_range[T.785-789] | -8.0972 | 1.098 | -7.372 | 0.000 | -10.251 -5.943 |
fico_range[T.790-794] | -8.3085 | 1.092 | -7.607 | 0.000 | -10.450 -6.167 |
fico_range[T.795-799] | -8.8419 | 1.150 | -7.686 | 0.000 | -11.098 -6.586 |
fico_range[T.800-804] | -8.6511 | 1.163 | -7.439 | 0.000 | -10.931 -6.371 |
fico_range[T.805-809] | -8.5820 | 1.178 | -7.284 | 0.000 | -10.892 -6.272 |
fico_range[T.810-814] | -6.9183 | 1.245 | -5.556 | 0.000 | -9.360 -4.477 |
fico_range[T.815-819] | -9.0170 | 1.323 | -6.817 | 0.000 | -11.611 -6.423 |
fico_range[T.820-824] | -7.3120 | 2.393 | -3.056 | 0.002 | -12.004 -2.620 |
fico_range[T.830-834] | -7.5920 | 2.393 | -3.173 | 0.002 | -12.284 -2.900 |
Omnibus: | 130.177 | Durbin-Watson: | 1.922 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 181.842 |
Skew: | 0.477 | Prob(JB): | 3.26e-40 |
Kurtosis: | 3.916 | Cond. No. | 146. |
lm.pvalues
Intercept 3.159062e-52 C(loan_length)[T.60.0] 1.557353e-282 fico_range[T.645-649] 8.367674e-01 fico_range[T.650-654] 9.726628e-01 fico_range[T.655-659] 8.473899e-01 fico_range[T.660-664] 2.775211e-02 fico_range[T.665-669] 1.809146e-01 fico_range[T.670-674] 7.278396e-01 fico_range[T.675-679] 8.181856e-01 fico_range[T.680-684] 3.109028e-01 fico_range[T.685-689] 1.534355e-01 fico_range[T.690-694] 1.133903e-01 fico_range[T.695-699] 2.811550e-02 fico_range[T.700-704] 2.992288e-03 fico_range[T.705-709] 3.057081e-04 fico_range[T.710-714] 1.727181e-04 fico_range[T.715-719] 9.680745e-07 fico_range[T.720-724] 2.875148e-07 fico_range[T.725-729] 4.785347e-08 fico_range[T.730-734] 4.420753e-10 fico_range[T.735-739] 1.411781e-10 fico_range[T.740-744] 7.028897e-11 fico_range[T.745-749] 2.473532e-11 fico_range[T.750-754] 2.813765e-13 fico_range[T.755-759] 5.797948e-13 fico_range[T.760-764] 1.211201e-12 fico_range[T.765-769] 1.823367e-14 fico_range[T.770-774] 6.428102e-15 fico_range[T.775-779] 6.805091e-12 fico_range[T.780-784] 3.294661e-16 fico_range[T.785-789] 2.278067e-13 fico_range[T.790-794] 3.962034e-14 fico_range[T.795-799] 2.181115e-14 fico_range[T.800-804] 1.391379e-13 fico_range[T.805-809] 4.343024e-13 fico_range[T.810-814] 3.054587e-08 fico_range[T.815-819] 1.163922e-11 fico_range[T.820-824] 2.266550e-03 fico_range[T.830-834] 1.526738e-03
Finally we check that the residual error is normally distributed, i.e. no non-random patterns of variation in the residuals.
qqplot(lm.resid, line='s');