Feature selection
apply
and groupby
import pandas as pd
import numpy as np
import pylab as pl
df = pd.read_csv("./data/credit-data-trainingset.csv")
df.head()
serious_dlqin2yrs | revolving_utilization_of_unsecured_lines | age | number_of_time30-59_days_past_due_not_worse | debt_ratio | monthly_income | number_of_open_credit_lines_and_loans | number_of_times90_days_late | number_real_estate_loans_or_lines | number_of_time60-89_days_past_due_not_worse | number_of_dependents | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.766127 | 45 | 2 | 0.802982 | 9120 | 13 | 0 | 6 | 0 | 2 |
1 | 0 | 0.957151 | 40 | 0 | 0.121876 | 2600 | 4 | 0 | 0 | 0 | 1 |
2 | 0 | 0.658180 | 38 | 1 | 0.085113 | 3042 | 2 | 1 | 0 | 0 | 0 |
3 | 0 | 0.907239 | 49 | 1 | 0.024926 | 63588 | 7 | 0 | 1 | 0 | 0 |
4 | 0 | 0.213179 | 74 | 0 | 0.375607 | 3500 | 3 | 0 | 1 | 0 | 1 |
We're going to let scikit-learn help us determine which variables are the best at predicting risk. To do this, we're going to use an algorithm called RandomForest
. RandomForest
randomly generates a "forest" of decision trees. As the trees are randomly generated, the algorithm takes turns leaving out each variable in fitting the model. This allows the RandomForest
to calculate just how much worse a model does when each variable is left out.
from sklearn.ensemble import RandomForestClassifier
features = np.array(['revolving_utilization_of_unsecured_lines',
'age', 'number_of_time30-59_days_past_due_not_worse',
'debt_ratio', 'monthly_income','number_of_open_credit_lines_and_loans',
'number_of_times90_days_late', 'number_real_estate_loans_or_lines',
'number_of_time60-89_days_past_due_not_worse', 'number_of_dependents'])
clf = RandomForestClassifier(compute_importances=True)
clf.fit(df[features], df['serious_dlqin2yrs'])
/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py:783: DeprecationWarning: Setting compute_importances is no longer required as version 0.14. Variable importances are now computed on the fly when accessing the feature_importances_ attribute. This parameter will be removed in 0.16. DeprecationWarning)
RandomForestClassifier(bootstrap=True, compute_importances=None, criterion='gini', max_depth=None, max_features='auto', min_density=None, min_samples_leaf=1, min_samples_split=2, n_estimators=10, n_jobs=1, oob_score=False, random_state=None, verbose=0)
# from the calculated importances, order them from most to least important
# and make a barplot so we can visualize what is/isn't important
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)
padding = np.arange(len(features)) + 0.5
pl.barh(padding, importances[sorted_idx], align='center')
pl.yticks(padding, features[sorted_idx])
pl.xlabel("Relative Importance")
pl.title("Variable Importance")
pl.show()
So you can see that the best variable is revolving_utilization_of_unsecured_lines
while the worst is number_real_estate_loans_or_lines
. There's also a dramatic drop off after number_of_open_credit_lines_and_loans
. This is where you need to use your own discretion. How many variables should you include in your model?
Feature selection/engineering will likely have the biggest impact in determining the success/failure of your model. Even if you're using the latest and greatest algorithm, if you put in non-important features you're going to get poor results. Remember, it's math, not magic.
Feature engineering is a skill that will take time to get the hang of. Sometimes the best way is to just talk to people. Ask questions, brainstorm with others, etc. Oftentimes 2 features might not be helpful when used individually, but when combined they can be extremely powerful.
df['income_bins'] = pd.cut(df.monthly_income, bins=15)
pd.value_counts(df['income_bins'])
# not very helpful
(-3008.75, 200583.333] 112392 (200583.333, 401166.667] 11 (601750, 802333.333] 5 (401166.667, 601750] 4 (1404083.333, 1604666.667] 1 (802333.333, 1002916.667] 1 (2808166.667, 3008750] 1 dtype: int64
def cap_values(x, cap):
if x > cap:
return cap
else:
return x
df.monthly_income = df.monthly_income.apply(lambda x: cap_values(x, 15000))
df.monthly_income.describe()
count 112415.000000 mean 5916.167344 std 3644.715884 min 0.000000 25% 3235.000000 50% 5200.000000 75% 8000.000000 max 15000.000000 dtype: float64
df['income_bins'] = pd.cut(df.monthly_income, bins=15, labels=False)
pd.value_counts(df.income_bins)
3 14614 4 14412 2 13192 5 12034 6 10713 1 8042 7 7839 8 5743 9 5597 14 5439 0 4465 10 4307 11 2536 12 1959 13 1523 dtype: int64
df[["income_bins", "serious_dlqin2yrs"]].groupby("income_bins").mean()
serious_dlqin2yrs | |
---|---|
income_bins | |
0 | 0.047256 |
1 | 0.098980 |
2 | 0.092480 |
3 | 0.079239 |
4 | 0.068485 |
5 | 0.067309 |
6 | 0.057220 |
7 | 0.056767 |
8 | 0.053108 |
9 | 0.048061 |
10 | 0.042721 |
11 | 0.041009 |
12 | 0.040327 |
13 | 0.042679 |
14 | 0.048538 |
cols = ["income_bins", "serious_dlqin2yrs"]
df[cols].groupby("income_bins").mean().plot()
<matplotlib.axes.AxesSubplot at 0x10bf4cb50>
cols = ['age', 'serious_dlqin2yrs']
age_means = df[cols].groupby("age").mean()
age_means.plot()
<matplotlib.axes.AxesSubplot at 0x10bf70350>
age
into 14 different groups. Then make a frequency table that shows the number of customers that were/were not delinquent for each bin.¶HINT: You might want to have larger bins near the min/max values to account for outliers.
mybins = [0] + range(20, 80, 5) + [120]
df['age_bucket'] = pd.cut(df.age, bins=mybins)
pd.value_counts(df['age_bin'])
(45, 50] 14112 (50, 55] 13390 (55, 60] 12629 (60, 65] 12317 (40, 45] 12053 (35, 40] 10241 (65, 70] 8315 (30, 35] 8123 (75, 120] 7581 (25, 30] 5803 (70, 75] 5600 (20, 25] 2250 dtype: int64
df[["age_bucket", "serious_dlqin2yrs"]].groupby("age_bucket").mean()
serious_dlqin2yrs | |
---|---|
age_bucket | |
(20, 25] | 0.109778 |
(25, 30] | 0.116319 |
(30, 35] | 0.108211 |
(35, 40] | 0.088956 |
(40, 45] | 0.085124 |
(45, 50] | 0.080995 |
(50, 55] | 0.072890 |
(55, 60] | 0.050598 |
(60, 65] | 0.039864 |
(65, 70] | 0.026458 |
(70, 75] | 0.026607 |
(75, 120] | 0.020314 |
df[["age_bucket", "serious_dlqin2yrs"]].groupby("age_bucket").mean().plot()
<matplotlib.axes.AxesSubplot at 0x1078df650>
labels, levels = pd.factorize(df.age_bucket)
df.age_bucket = labels
Hint: use the quantile
method for Series
bins = []
for q in [0.2, 0.4, 0.6, 0.8, 1.0]:
bins.append(df.debt_ratio.quantile(q))
debt_ratio_binned = pd.cut(df.debt_ratio, bins=bins)
debt_ratio_binned
print pd.value_counts(debt_ratio_binned)
(0.467, 3.838] 22483 (3.838, 307001] 22483 (0.134, 0.287] 22483 (0.287, 0.467] 22483 dtype: int64
Some algorithms will work better if your data is centered around 0. The StandardScaler
module in scikit-learn
makes it very easy to quickly scale columns in your data frame.
from sklearn.preprocessing import StandardScaler
df['monthly_income_scaled'] = StandardScaler().fit_transform(df.monthly_income)
print df.monthly_income_scaled.describe()
print
print "Mean at 0?", round(df.monthly_income_scaled.mean(), 10)==0
pl.hist(df.monthly_income_scaled)
count 1.124150e+05 mean 1.023365e-15 std 1.000004e+00 min -1.623225e+00 25% -7.356346e-01 50% -1.964956e-01 75% 5.717433e-01 max 2.492341e+00 dtype: float64 Mean at 0? True
(array([ 8669., 15715., 22969., 17668., 15316., 10149., 7454., 5079., 2939., 6457.]), array([-1.62322492, -1.21166838, -0.80011184, -0.38855529, 0.02300125, 0.4345578 , 0.84611434, 1.25767088, 1.66922743, 2.08078397, 2.49234051]), <a list of 10 Patch objects>)
features = np.array(['revolving_utilization_of_unsecured_lines',
'age', 'number_of_time30-59_days_past_due_not_worse',
'debt_ratio', 'monthly_income','number_of_open_credit_lines_and_loans',
'number_of_times90_days_late', 'number_real_estate_loans_or_lines',
'number_of_time60-89_days_past_due_not_worse', 'number_of_dependents',
'income_bins', 'age_bucket', 'monthly_income_scaled'])
clf = RandomForestClassifier(compute_importances=True)
clf.fit(df[features], df['serious_dlqin2yrs'])
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)
padding = np.arange(len(features)) + 0.5
pl.barh(padding, importances[sorted_idx], align='center')
pl.yticks(padding, features[sorted_idx])
pl.xlabel("Relative Importance")
pl.title("Variable Importance")
pl.show()
/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py:783: DeprecationWarning: Setting compute_importances is no longer required as version 0.14. Variable importances are now computed on the fly when accessing the feature_importances_ attribute. This parameter will be removed in 0.16. DeprecationWarning)
best_features = features[sorted_idx][::-1]
best_features
array(['revolving_utilization_of_unsecured_lines', 'debt_ratio', 'monthly_income', 'number_of_times90_days_late', 'age', 'monthly_income_scaled', 'number_of_open_credit_lines_and_loans', 'number_of_time30-59_days_past_due_not_worse', 'number_of_time60-89_days_past_due_not_worse', 'age_bucket', 'number_of_dependents', 'number_real_estate_loans_or_lines', 'income_bins'], dtype='|S43')
Partner with the person sitting next to you and see if you can come up with some new features that outperform the basic variables.
Things you might try:
pd.get_dummies
pandas
and scikit-learn
scikit-learn