Agenda¶

Define the problem and the approach
Data basics: loading data, looking at your data, basic commands
Handling missing values
Intro to scikit-learn
Grouping and aggregating data
Feature selection
Fitting and evaluating a model
Deploying your work

In this workbook you will¶

Use RandomForest to calculate the best features for your model
Handling outlying data using apply and groupby
Discretize (or bucket) data into groups
Make your own features for your model

In [1]:

import pandas as pd
import numpy as np
import pylab as pl

In [2]:

df = pd.read_csv("./data/credit-data-trainingset.csv")

In [3]:

df.head()

Out[3]:

	serious_dlqin2yrs	revolving_utilization_of_unsecured_lines	age	number_of_time30-59_days_past_due_not_worse	debt_ratio	monthly_income	number_of_open_credit_lines_and_loans	number_of_times90_days_late	number_real_estate_loans_or_lines	number_of_dependents
0	1	0.766127	45	2	0.802982	9120	13	0	6	2
1	0	0.957151	40	0	0.121876	2600	4	0	0	1
2	0	0.658180	38	1	0.085113	3042	2	1	0	0
3	0	0.907239	49	1	0.024926	63588	7	0	1	0
4	0	0.213179	74	0	0.375607	3500	3	0	1	1

Finding Important Features¶

We're going to let scikit-learn help us determine which variables are the best at predicting risk. To do this, we're going to use an algorithm called RandomForest. RandomForest randomly generates a "forest" of decision trees. As the trees are randomly generated, the algorithm takes turns leaving out each variable in fitting the model. This allows the RandomForest to calculate just how much worse a model does when each variable is left out.

In [4]:

from sklearn.ensemble import RandomForestClassifier

features = np.array(['revolving_utilization_of_unsecured_lines',
                     'age', 'number_of_time30-59_days_past_due_not_worse',
                     'debt_ratio', 'monthly_income','number_of_open_credit_lines_and_loans', 
                     'number_of_times90_days_late', 'number_real_estate_loans_or_lines',
                     'number_of_time60-89_days_past_due_not_worse', 'number_of_dependents'])

In [5]:

clf = RandomForestClassifier(compute_importances=True)
clf.fit(df[features], df['serious_dlqin2yrs'])

/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py:783: DeprecationWarning: Setting compute_importances is no longer required as version 0.14. Variable importances are now computed on the fly when accessing the feature_importances_ attribute. This parameter will be removed in 0.16.
  DeprecationWarning)

Out[5]:

RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            min_density=None, min_samples_leaf=1, min_samples_split=2,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0)

In [6]:

# from the calculated importances, order them from most to least important
# and make a barplot so we can visualize what is/isn't important
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)

In [7]:

padding = np.arange(len(features)) + 0.5
pl.barh(padding, importances[sorted_idx], align='center')
pl.yticks(padding, features[sorted_idx])
pl.xlabel("Relative Importance")
pl.title("Variable Importance")
pl.show()

So you can see that the best variable is revolving_utilization_of_unsecured_lines while the worst is number_real_estate_loans_or_lines. There's also a dramatic drop off after number_of_open_credit_lines_and_loans. This is where you need to use your own discretion. How many variables should you include in your model?

Engineering Additional Features¶

Feature selection/engineering will likely have the biggest impact in determining the success/failure of your model. Even if you're using the latest and greatest algorithm, if you put in non-important features you're going to get poor results. Remember, it's math, not magic.

Feature engineering is a skill that will take time to get the hang of. Sometimes the best way is to just talk to people. Ask questions, brainstorm with others, etc. Oftentimes 2 features might not be helpful when used individually, but when combined they can be extremely powerful.

In [8]:

df['income_bins'] = pd.cut(df.monthly_income, bins=15)
pd.value_counts(df['income_bins'])
# not very helpful

Out[8]:

(-3008.75, 200583.333]        112392
(200583.333, 401166.667]          11
(601750, 802333.333]               5
(401166.667, 601750]               4
(1404083.333, 1604666.667]         1
(802333.333, 1002916.667]          1
(2808166.667, 3008750]             1
dtype: int64

Bucketing Continuous Values¶

Use the cap_value function you wrote previously to cap monthly_income at $15,000¶

In [9]:

def cap_values(x, cap):
    if x > cap:
        return cap
    else:
        return x
    
df.monthly_income = df.monthly_income.apply(lambda x: cap_values(x, 15000))

In [10]:

df.monthly_income.describe()

Out[10]:

count    112415.000000
mean       5916.167344
std        3644.715884
min           0.000000
25%        3235.000000
50%        5200.000000
75%        8000.000000
max       15000.000000
dtype: float64

In [11]:

df['income_bins'] = pd.cut(df.monthly_income, bins=15, labels=False)
pd.value_counts(df.income_bins)

Out[11]:

3     14614
4     14412
2     13192
5     12034
6     10713
1      8042
7      7839
8      5743
9      5597
14     5439
0      4465
10     4307
11     2536
12     1959
13     1523
dtype: int64

In [12]:

df[["income_bins", "serious_dlqin2yrs"]].groupby("income_bins").mean()

Out[12]:

	serious_dlqin2yrs
income_bins
0	0.047256
1	0.098980
2	0.092480
3	0.079239
4	0.068485
5	0.067309
6	0.057220
7	0.056767
8	0.053108
9	0.048061
10	0.042721
11	0.041009
12	0.040327
13	0.042679
14	0.048538

In [13]:

cols = ["income_bins", "serious_dlqin2yrs"]
df[cols].groupby("income_bins").mean().plot()

Out[13]:

<matplotlib.axes.AxesSubplot at 0x10bf4cb50>

In [14]:

cols = ['age', 'serious_dlqin2yrs']
age_means = df[cols].groupby("age").mean()
age_means.plot()

Out[14]:

<matplotlib.axes.AxesSubplot at 0x10bf70350>

Bin `age` into 14 different groups. Then make a frequency table that shows the number of customers that were/were not delinquent for each bin.¶

HINT: You might want to have larger bins near the min/max values to account for outliers.

In [24]:

mybins = [0] + range(20, 80, 5) + [120]
df['age_bucket'] = pd.cut(df.age, bins=mybins)
pd.value_counts(df['age_bin'])

Out[24]:

(45, 50]     14112
(50, 55]     13390
(55, 60]     12629
(60, 65]     12317
(40, 45]     12053
(35, 40]     10241
(65, 70]      8315
(30, 35]      8123
(75, 120]     7581
(25, 30]      5803
(70, 75]      5600
(20, 25]      2250
dtype: int64

Using the age bins, calculate the percent of customers that were delinquent for each bucket¶

In [25]:

df[["age_bucket", "serious_dlqin2yrs"]].groupby("age_bucket").mean()

Out[25]:

	serious_dlqin2yrs
age_bucket
(20, 25]	0.109778
(25, 30]	0.116319
(30, 35]	0.108211
(35, 40]	0.088956
(40, 45]	0.085124
(45, 50]	0.080995
(50, 55]	0.072890
(55, 60]	0.050598
(60, 65]	0.039864
(65, 70]	0.026458
(70, 75]	0.026607
(75, 120]	0.020314

In [26]:

df[["age_bucket", "serious_dlqin2yrs"]].groupby("age_bucket").mean().plot()

Out[26]:

<matplotlib.axes.AxesSubplot at 0x1078df650>

In [27]:

labels, levels = pd.factorize(df.age_bucket)
df.age_bucket = labels

Write something that buckets debt_ratio into 4 (nearly) equally sized groups.¶

Hint: use the quantile method for Series

In [34]:

bins = []

for q in [0.2, 0.4, 0.6, 0.8, 1.0]:
    bins.append(df.debt_ratio.quantile(q))

debt_ratio_binned = pd.cut(df.debt_ratio, bins=bins)
debt_ratio_binned
print pd.value_counts(debt_ratio_binned)

(0.467, 3.838]     22483
(3.838, 307001]    22483
(0.134, 0.287]     22483
(0.287, 0.467]     22483
dtype: int64

Scaling Features¶

Some algorithms will work better if your data is centered around 0. The StandardScaler module in scikit-learn makes it very easy to quickly scale columns in your data frame.

In [35]:

from sklearn.preprocessing import StandardScaler

df['monthly_income_scaled'] = StandardScaler().fit_transform(df.monthly_income)

In [36]:

print df.monthly_income_scaled.describe()
print
print "Mean at 0?", round(df.monthly_income_scaled.mean(), 10)==0

pl.hist(df.monthly_income_scaled)

count    1.124150e+05
mean     1.023365e-15
std      1.000004e+00
min     -1.623225e+00
25%     -7.356346e-01
50%     -1.964956e-01
75%      5.717433e-01
max      2.492341e+00
dtype: float64

Mean at 0? True

Out[36]:

(array([  8669.,  15715.,  22969.,  17668.,  15316.,  10149.,   7454.,
         5079.,   2939.,   6457.]),
 array([-1.62322492, -1.21166838, -0.80011184, -0.38855529,  0.02300125,
        0.4345578 ,  0.84611434,  1.25767088,  1.66922743,  2.08078397,
        2.49234051]),
 <a list of 10 Patch objects>)

Redo our feature importance¶

In [37]:

features = np.array(['revolving_utilization_of_unsecured_lines',
                     'age', 'number_of_time30-59_days_past_due_not_worse',
                     'debt_ratio', 'monthly_income','number_of_open_credit_lines_and_loans', 
                     'number_of_times90_days_late', 'number_real_estate_loans_or_lines',
                     'number_of_time60-89_days_past_due_not_worse', 'number_of_dependents',
                     'income_bins', 'age_bucket', 'monthly_income_scaled'])

clf = RandomForestClassifier(compute_importances=True)
clf.fit(df[features], df['serious_dlqin2yrs'])

importances = clf.feature_importances_
sorted_idx = np.argsort(importances)

padding = np.arange(len(features)) + 0.5
pl.barh(padding, importances[sorted_idx], align='center')
pl.yticks(padding, features[sorted_idx])
pl.xlabel("Relative Importance")
pl.title("Variable Importance")
pl.show()

/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py:783: DeprecationWarning: Setting compute_importances is no longer required as version 0.14. Variable importances are now computed on the fly when accessing the feature_importances_ attribute. This parameter will be removed in 0.16.
  DeprecationWarning)