#!/usr/bin/env python # coding: utf-8 # We'll download the 2013-2014 data and uncompress it from inside our notebook by invoking the command line (I didn't feel like installing wget on OSX but at least we have curl: # In[ ]: get_ipython().run_cell_magic('bash', '', 'curl https://resources.lendingclub.com/LoanStats3c_securev.csv.zip | tar -xf-\n') # Let's also take a quick look at the data via bash (file size, head, line count, column count). # In[ ]: get_ipython().system('du -h LoanStats3c_securev.csv') # In[ ]: get_ipython().system('head -n 2 LoanStats3c_securev.csv') # Examining the data we see that most of feature names are intuitive. We can get the specifics from the provided data dictionary at https://www.lendingclub.com/info/download-data.action # In[ ]: get_ipython().system('wc -l < LoanStats3c_securev.csv') # In[ ]: get_ipython().system("head -2 LoanStats3c_securev.csv | sed 's/[^,]//g' | wc -c") # It looks we have 235,630 entries and 57 attributes. At a glance some of the more notable atributes are: # # 1. Amount Requested # 2. Amount Funded by Investors* # 3. Interest Rate # 4. Term (Loan Length) # 5. Purpose of Loan # 6. Debt/Income Ratio ** # 7. State # 8. Rent or Own Home # 9. Annual Income # 10. 30+ days past-due incidences of delinquency in the borrower's credit file for the past 2 years # 10. FICO Low # 11. FICO High # 12. Last FICO Low # 13. Last FiCO High # 12. Credit Lines Open # 13. Revolving Balance # 14. Inquiries in Last 6 Months # 15. Length of Employment # # The last vs current FICO scores did not exist the last time we looked at the data, I'm thinking they added this to asses directionality or momentum, although this few of samples is suspicious. Interestingly enough when looking through the attributes provided for declined loans I found a "score" attribute which was desribed as such: # # >*For applications prior to November 5, 2013 the risk score is the borrower's FICO score. For applications after November 5, 2013 the risk score is the borrower's [Vantage](https://en.wikipedia.org/wiki/VantageScore) score.* # # I suspect they are actually using both. # # *Lending club states that the amount funded by investors has no affect on the final interest rate assigned to a loan. # # ** Debt/Income Ratio ratio takes all of your monthly liabilities and divides the total by your gross monthly income. # # # 1. Our Data # Let's use Pandas to manipulate our data and find an excuse to use the new pipe operators http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pipe.html # # Our data is 126mb on disk, 102.2 in memory, not the greatest reduction in side I've seen, but we have a lot of inefficient `object` data types in the df to start. # In[98]: import numpy as np import pandas as pd import seaborn as sns get_ipython().run_line_magic('matplotlib', 'inline') # In[99]: df = pd.read_csv("LoanStats3c_securev.csv",skiprows=1) df.info() # In[100]: df.head(3) # # Keeping what we need # Let's work through our list of attributes slice-by-slice and figure out what we really need. This involves dropping some useless attributes and cleaning-up/modifying others. Let's work through the columns in batches to keep the cognitive burden low: # In[101]: # .ix[row slice, column slice] df.ix[:4,:7] # 1. We won't need id or member_id as it has no real predictive power so we can drop them from this table # 2. int_rate was loaded as an `object` data type instead of `float` due to the '%' character. Let's strip that out and convert the column type. # In[102]: df.drop(['id','member_id'],1, inplace=True) # In[103]: df.int_rate = pd.Series(df.int_rate).str.replace('%', '').astype(float) # ## Loan Amount Requested Verus the Funded Amount # At a glance I thought it important to discover the relationship between loan amount requested verus the funded amount to see if Lending Club ever issues a lower amount than what is asked and **why**. However after taking a look at the data dictionary I discovered that the data has already been manipulated: # # >*"The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value."* # # Alas, it would have been interesting to see how often Lending Club issues a loan that differs from the requested amount. Doing a quick santiy check below we see there are no instances in the data where the reuqested amount doesn't match the funded. # In[104]: print (df.loan_amnt != df.funded_amnt).value_counts() # Instead of old subset syntax let's use the new `query` method. # df.query('loan_amnt != funded_amnt').ix[:,:5].head(3) # Old way # Let's keep moving through the columns # In[105]: df.ix[:5,8:15] # ## Employment Title # `emp_title` might be a free text field on the application form or a list of currated employment titles. Let's examine how many unique values exist: # In[106]: print df.emp_title.value_counts().head() print df.emp_title.value_counts().tail() df.emp_title.unique().shape # 75,353 unique entries among 235,630 observations seems like a bit of a stretch. Taking a look at the head vs tail of the doc shows some suspiciously specific titles such as `bpd/gms coordinator`. I feel comfortable assessing that this data won't be meaningful and any relationship we might observe might be due to confounding relationships. A more advanced implementation might look to group all these job descriptions into categories and/or examine if Lending Club's model looks at (income + job) versus just income, but that's out of the scope of this post. # # Here we define a confounding variable as a variable that obscures the effects of another variable. If one 1st grade teacher used a phonics textbook in her class and a different teacher used a whole language textbook in his class, and students in the two classes were # given achievement tests to see how well they read, the independent variables # (teacher effectiveness and textbooks) would be confounded. There is no way to # determine if differences in reading between the two classes were caused by # either or both of the independent variables. # # Applying this example to our dataset, Registered Nurses, or RNs, who have higher education requirements and receive above average pay, might be granted A grade loans on average. Is this due to them working as RNs or having a 4-year degree or their salary? Would we see the same effect for Physician Assistants who go to school for longer and receive even more pay? What about Certified Nursing Assistants (CNAs) who are on the oppposite spectrum? # In[107]: df.drop(['emp_title'],1, inplace=True) # ## Employment Length # Leaving this variable in might contradict our decision to drop the employment tile as it also conveyed a sort of socio-economic seniority. A Computer Scientist 5 years into their career would generally have a larger salary than a Kindergarden teacher 10 years into their career. Arguably it might be powerful to combine a grouped, matched, and reduced set of employment titles with their length to create a "purchasing power" metric. Since employment length is an easy scalar, let's leave it in for now. # # We could leave 'emp_length' as categorical data, but it shouldn't be treated as such or as ordinal data since the intervals are easy to determine. We can convert it into numerical data with a simple filter: # In[108]: df.emp_length.value_counts() # In[109]: df.replace('n/a', np.nan,inplace=True) df.emp_length.fillna(value=0,inplace=True) df['emp_length'].replace(to_replace='[^0-9]+', value='', inplace=True, regex=True) df['emp_length'] = df['emp_length'].astype(int) # ## Verification Status # The values of `verification_status` aren't immediately clear. As `VERIFIED - income` and `VERIFIED - income source` could potentially be the same criteria. The provided data dictionary file describes the field as such: _"Indicates if income was verified by LC, not verified, or if the income source was verified"_ This description does not help as you'd intuitively think that you would have to verify the source in order to verify income. For now we'll trust that these are indeed different methods or levels (perhaps even ordinal) of verification. # In[110]: df.verification_status.value_counts() # ## Loan Statuses # Moving on to the last column in the slice of the df we are examining: loan status is mutable value that represents the current state of the loan. If anything we might want to examine if all the independent variables and/or interst rate to determine the probability of the loan status at some time. In this post however we are focused on predicting the interest rate granted to an applicant at loan creation. Thus we do not care about this variable and will drop it after examining it out of curiosity. # # When examining the distribution among the loan statuses, October has the highest amount of loans right ahead of the holiday season. To clean up the space let's replace the column value with a month only and convert the column back to a string or object type. # In[111]: print df.loan_status.value_counts() issue_d_todate = pd.to_datetime(df.issue_d)# (df['issue_d'].apply(lambda x: x.strftime('%Y-%m-%d'))) df.issue_d = pd.Series(df.issue_d).str.replace('-2014', '') # We need sort_index() or else we won't get a sequential timedate order. issue_d_todate.value_counts().sort_index().plot(kind='bar') df.drop(['loan_status'],1, inplace=True) # ## Arbitrary Applicant Input # Continuing on to examine more columns: # In[112]: df.ix[:5,12:21] # In[113]: print df.purpose.value_counts() print '' print df.title.value_counts().tail() # We'll be dropping `pymnt_plan` as it has the same current state info that `loan_status` had above. It indicates that the loan is in jeopardy and that the borrower has been placed on a payment plan. # # Examining the unique counts of purpose show that they are lending club selected options;howver, `title` and `desc` are arbitrary free-text from the applicant. # In[114]: df.drop(['pymnt_plan','url','desc','title' ],1, inplace=True) # Moving on to examine the next slice of columns: # In[115]: df.ix[:5,17:25] # ## Delinquency # Lets examine the distribution of delinquency across all applicants: # In[116]: df.delinq_2yrs.value_counts() # ## Earliest Credit Line # `earliest_cr_line` is currently an object dtype however we don't want to convert it to a date, rather a scalar to describe the length of time since the first line of credit. We are asserting that if all variables are held equal (number of lines open, income, delinquencies, verfied income); the longer you've had lines of credit the better. # # Conversely an acturary might asses that we can end up with a "u-shaped" distribution in that if you are the extreme end of years, you are a higher risk as you have a higher chance of mortality thus a lower probability of repayment. Additionally pre-retirement debitors are more likely to list job or health reasons as the primary cause of their bankruptcy. # # In a more advanced implementation we'd want to account for confounding variables in that certain applicant groups, the older you are (time since first credit line) the larger your earning potential/power and thus the better loan you might secure, however this increase in salary might be negligible if the amount of discretionary money spent or need for a loan scales in some proportion to the salary size. # In[117]: from datetime import datetime df.earliest_cr_line = pd.to_datetime(df.earliest_cr_line) dttoday = datetime.now().strftime('%Y-%m-%d') # There is a better way to do this :) df.earliest_cr_line = df.earliest_cr_line.apply(lambda x: ( np.timedelta64((x - pd.Timestamp(dttoday)),'D').astype(int))/-365) df.earliest_cr_line # ## FICO Ranges # The FICO `fico_range_low` & `fico_range_high` scores on their own aren't as useful as a range and average. So let's create that. `initial_list_status` is unclear if it describes a post or pre determined interest period. # In[118]: df['fico_range'] = df.fico_range_low.astype('str') + '-' + df.fico_range_high.astype('str') df['meanfico'] = (df.fico_range_low + df.fico_range_high)/2 df.drop(['fico_range_low','fico_range_high','initial_list_status'],1, inplace=True) # Moving on to the next set of columns # In[119]: df.ix[:5,23:32] # ## Revolving Utility # Let's do the same thing we did for the interest rate column # In[120]: df.revol_util = pd.Series(df.revol_util).str.replace('%', '').astype(float) # In[121]: df.ix[:5,23:32] # ## More Post Loan Attributes # Similar to some attributes we encountered above, we have more columns to drop that detail attributes about the loan after it was granted, which we don't care about. # In[122]: df.drop(['out_prncp','out_prncp_inv','total_pymnt', 'total_pymnt_inv','total_rec_prncp', 'grade', 'sub_grade'] ,1, inplace=True) # In[123]: df.ix[:5,23:32] # In[124]: df.drop(['total_rec_int','total_rec_late_fee', 'recoveries','collection_recovery_fee', 'collection_recovery_fee' ],1, inplace=True) # In[125]: df.ix[:5,23:32] # In[126]: df.drop(['last_pymnt_d','last_pymnt_amnt', 'next_pymnt_d','last_credit_pull_d'],1, inplace=True) # In[127]: df.ix[:5,23:32] # ## Last FICO Ranges # Our previous FICO attributes were the current ranges. # In[128]: df['last_fico_range'] = df.last_fico_range_low.astype('str') + '-' + df.last_fico_range_high.astype('str') df['last_meanfico'] = (df.last_fico_range_low + df.last_fico_range_high)/2 df.drop(['last_fico_range_high','last_fico_range_low','policy_code'],1, inplace=True) # ## Wrapping Up the Data Munging # # Okay so now we got out data down to 61mb from 120mb. Imagine what we could do if we used more efficicent data types! Let's take one last look at it to then move on to our modeling phase. # In[129]: print df.columns print df.head(1).values df.info() # In[130]: df.fillna(0.0,inplace=True) df.fillna(0,inplace=True) # # 2. Our model # Staying true to the original intent of this blog, reproducing my popular old R post, I will use a gradient boosting regression model (more on this later), even if it's easier to use random forest and avoid dummy encoding of my categorical data. If you walk through the broken out steps above you'll find that we have: # # 1. Addressed missing data # 2. Converted strings to numerical representations where possible # 3. Dropped superfluous attributes # # **Note:** We have to pay special attention to how we replace missing values since we can introduce bias if the data is not missing at random. # # The usual next steps are to: # 1. Eliminate zero (or near-zero) variance predictors # 2. Highly and correlated predictors # # Due to all the categorical variables we might want to use Kendall's correlation matrix. If our data was larger we might want to reduce the size of it via more advanced feature selection methods such as clustering (choosing a single representative of each cluster), PCA, ICA, and so on. Since the purpose of this experiment is explanatory rather than predictive, we wouldn't want to use PCA and try to explain the covariates in terms of principal components. # # ## Highly Correlated Data # Let's examine our dataframes correlation matrix and drop highly correlated/redundant data to address multicollinearity. # In[131]: cor = df.corr() cor.loc[:,:] = np.tril(cor, k=-1) # below main lower triangle of an array cor = cor.stack() cor[(cor > 0.55) | (cor < -0.55)] # We will drop the non-informative columns found above. Note that we missed a few columns in our munging process above such as `funded_amnt_inv` which is the amount funded by customers at that point in time which is post interest rate determination. We also missed `installment` which is the monthly payment owed by the borrower if the loan originates. # In[132]: df.drop(['zip_code','funded_amnt','funded_amnt_inv', 'installment', 'mths_since_last_delinq', 'total_acc'], axis=1, inplace=True) # ## Feature Space and Labels # Before we go on any further and verify what we already intuitively know. Let's make a quick (wasn't really quick) plot to examine the relationship between interest rate and FICO score. We also can finish this post without replicating the same image from original post from long ago. Arguably it was a lot more intuitive to do in native ggplot back then. # In[138]: plot_df = df.query('last_meanfico > 600 & int_rate <28')[:3000] sns.set(font_scale=1.2, rc={"lines.linewidth": 1.5}) g = sns.lmplot("int_rate", "last_meanfico", x_jitter= .7, y_jitter= .1, data=plot_df, hue='term',lowess=True, size=5,aspect=1.4, legend_out=False, scatter_kws={ 's':20, 'alpha':.6}) g.set(xlim=(2.5, 28),ylim=(580, 880),alpha = .5) g.savefig('1.png',transparent=True) # In[139]: y = df.int_rate.values df.drop('int_rate',axis = 1, inplace=True) # In[38]: np.unique(y), pd.Series(y).plot(kind='hist',alpha=.7, bins=20, title='Interest Rate Distribution') # We're about to blow up the feature space by dummy-encoding our categorical variables! This will make our dataframe 8x bgger in memory going from 60 to 470MB. # In[39]: df = pd.get_dummies(df) df.info() # ## Gradient Boosted Regression Trees # # Like the original post we will use GBTs. The tradeofs are listed below: # # Advantages # # - **Heterogeneous data** - Deatures measured on different scale such as employment length vs annual income # - **Supports different loss functions** - We can pick a loss function that suites our problem. If our data contains a lot of missing labels like huber loss, or something specific for ranking problems. # - **Automatically detects (non-linear) feature interactions**, # # Disadvantages # # - **Requires careful tuning** - RF are faster to tune, they have essentially one parameter # - **Slow to train** - But fast to predict # - **Cannot extrapolate** - It is not possible to predict beyond the minimum and maximum limits of the response variable in the training data. # In[40]: from sklearn import ensemble from sklearn import datasets from sklearn.utils import shuffle from sklearn.metrics import mean_squared_error from matplotlib import pyplot as plt # In[41]: X, y = shuffle(df.values, y, random_state=30) X = X.astype(np.float32) # In[42]: offset = int(X.shape[0] * 0.75) X_train, y_train = X[:offset], y[:offset] X_test, y_test = X[offset:], y[offset:] # ## Training & Tuning # We won't spend too much time in this post tuning, but in general GBTs give us threes knobs we can tune for overfitting: (1) Tree Structure, (2) Shrinkage, (3) Stochastic Gradient Boosting. In the interest of time we'll do a simple grid search amongst a hand chosen set of hyper-parameters. # # One of the most effective paramerters to tune for when working with a large feature set is `max_features` as it introduces a notion of randomization similar to Random Forests. Playing with max features allows us to perform subsampling of our feature space before finding the best split node. a `max_features` setting of .20 for example would grow each tree on 20% of the featureset. Conversely the `subsample` feature would use 20% of the training data (all features). Subsample interacts with the parameter n_estimators. Choosing subsample < 1.0 leads to a reduction of variance and an increase in bias. # # ### Loss Function # # - Squared loss minimizes expectation. Pick it when learning expected return on a stock. # # - Logistic loss minimizes probability. Pick it when learning probability of click on advertisement. # # - Hinge loss minimizes the 0,1 / yes-no question. Pick it when you want a hard prediction. # # - Quantile loss minimizes the median. Pick it when you want to predict house prices. # # - Ensembling multiple loss functions # # Adapted from: https://github.com/JohnLangford/vowpal_wabbit/wiki/Loss-functions # In[ ]: from sklearn.grid_search import GridSearchCV param_grid = {'learning_rate': [0.1, 0.05, 0.02, 0.01], 'max_depth': [4, 6], 'min_samples_leaf': [3, 5, 9, 17], 'max_features': [1.0, 0.3, 0.1] } # param_grid = {'learning_rate': [0.1], # 'max_depth': [4], # 'min_samples_leaf': [3], # 'max_features': [1.0], # } est = GridSearchCV(ensemble.GradientBoostingRegressor(n_estimators=100), param_grid, n_jobs=4, refit=True) est.fit(X_train, y_train) best_params = est.best_params_ # In[43]: get_ipython().run_cell_magic('time', '', 'est = ensemble.GradientBoostingRegressor(n_estimators=2000).fit(X_train, y_train)\n') # In[48]: est.score(X_test,y_test) # ## Examing results # In terms of simple model accuracy we know we're doing pretty good with the above score over 3000 iterations gets us in the 90s, 1000 gets us in the 60s.Let's examine how drastically accuracy increased with each iteration and see which model features were most powerful in predicting the issued iterest rate. # In[54]: sns.set(font_scale=1, rc={"lines.linewidth":1.2}) Iterations = 2000 # compute test set deviance test_score = np.zeros((Iterations,), dtype=np.float64) for i, y_pred in enumerate(est.staged_predict(X_test)): test_score[i] = est.loss_(y_test, y_pred) plt.figure(figsize=(14, 6)).subplots_adjust(wspace=.3) plt.subplot(1, 2, 1) plt.title('Deviance over iterations') plt.plot(np.arange(Iterations) + 1, est.train_score_, 'dodgerblue', label='Training Set Deviance', alpha=.6) plt.plot(np.arange(Iterations) + 1, test_score, 'firebrick', label='Test Set Deviance', alpha=.6) plt.legend(loc='upper right') plt.xlabel('Boosting Iterations') plt.ylabel('Deviance') plt.subplot(1, 2, 2,) # Top Ten feature_importance = est.feature_importances_ feature_importance = 100.0 * (feature_importance / feature_importance.max()) indices = np.argsort(feature_importance)[-10:] plt.barh(np.arange(10), feature_importance[indices],color='dodgerblue',alpha=.4) plt.yticks(np.arange(10 + 0.25), np.array(df.columns)[indices]) _ = plt.xlabel('Relative importance'), plt.title('Top Ten Important Variables') # ### Partial Dependence Plots # When working with tree models such as GBTs or Random Forests, I like to take a look at Partial Dependence Plots to understand the functional relations between predictors and an outcome. Tese plots capture marginal effect of a given variable or variables on the target function, in this case interest rate. # # In the first plot, tt is interesting to note that below $25k annual income, Debt-to-Income ratio has a large effect on the interest rate. The margin gets significantly larger after 75k. The steepness of meanfico is especially interesting, it looks like no real distinction is made for customers with a score over 760. # In[55]: from sklearn.ensemble.partial_dependence import plot_partial_dependence comp_features = [('annual_inc','dti'),'loan_amnt','meanfico','annual_inc', 'inq_last_6mths', 'revol_util', 'dti'] fig, axs = plot_partial_dependence(est, X_train, comp_features, feature_names=list(df.columns), figsize=(14, 14), n_jobs=4) # Trying out some alternate plots from `seaborn`. # In[56]: get_ipython().run_cell_magic('time', '', 'sns.jointplot(y,df.meanfico.values,annot_kws=dict(stat="r"),\n kind="kde", color="#4CB391").set_axis_labels("int_rate", "mean_fico")\n')