Define the problem and the approach
The data is made available to us by Kaggle and was used in a competition in 2011.
http://www.kaggle.com/c/GiveMeSomeCredit
Predict the probability that somebody will experience financial distress in the next two years.
import pandas as pd
df = pd.read_csv("./data/credit-training.csv")
df.head()
SeriousDlqin2yrs | RevolvingUtilizationOfUnsecuredLines | age | NumberOfTime30-59DaysPastDueNotWorse | DebtRatio | MonthlyIncome | NumberOfOpenCreditLinesAndLoans | NumberOfTimes90DaysLate | NumberRealEstateLoansOrLines | NumberOfTime60-89DaysPastDueNotWorse | NumberOfDependents | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.766127 | 45 | 2 | 0.802982 | 9120 | 13 | 0 | 6 | 0 | 2 |
1 | 0 | 0.957151 | 40 | 0 | 0.121876 | 2600 | 4 | 0 | 0 | 0 | 1 |
2 | 0 | 0.658180 | 38 | 1 | 0.085113 | 3042 | 2 | 1 | 0 | 0 | 0 |
3 | 0 | 0.233810 | 30 | 0 | 0.036050 | 3300 | 5 | 0 | 0 | 0 | 0 |
4 | 0 | 0.907239 | 49 | 1 | 0.024926 | 63588 | 7 | 0 | 1 | 0 | 0 |
df.SeriousDlqin2yrs.mean()
0.066839999999999997
Since we have a fairly large volume of data (150K), we're going to build a predictive model that will enable us to automatically score each applicant. We will provide each applicant with a credit score. This will give us an easy to interpret, human readable form of the model.
If we're building a model, we're going to need a way to know whether or not it's working. Convincing other is oftentimes the most challenging parts of an analysis. Making repeatable, well documented work with clear success metrics makes all the difference.
For our classifier, we're going to use the following build methodology:
from IPython.core.display import Image
Image(url="https://s3.amazonaws.com/demo-datasets/traintest.png")