Week 4: Supervised Learning¶

The end of week for marked the half way point of the structured curriculum portion of the program. The entire cohort is starting to get the drill down. Learn about a topic in the morning, impliment it by hand to see the nuts a bolts, and then finish up with the sklearn version to validate our results. Supervised Learning is all about classifying data to a category, or most often, to a 0 or a 1. For example, if you have data about high school students and you know if they got into college or not, you can use a model to predict whether a current high school student will get into a college.

Topics of the week:

kNN
Decision Trees
Entropy/Information Gain/Gini Impurity
Random Forest
Bagging/Boosting/Testing with Out Of Bag observations
Maximum Margin/Support Vector Classifier/SVM/Tuning with Kernals
Gradient Boosting/AdaBoosting
Profit Curves

For our code sample this week we are going to use Random Forests on a cell phone data set. We are going to try and predict if customers will churn or not based off of their cell usage statistics. Lets import our packages and see what our data looks like

In [11]:

import pandas as pd
import matplotlib.pyplot as plt
from roc import plot_roc
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
%matplotlib inline

In [7]:

df = pd.read_csv('data/churn.csv')

Were going to be working with cell phone data lets check out our features and clean it up a bit. There are some columns with "yes/no" and we want to change those over to 0's and 1's¶

In [8]:

df["Int'l Plan"] = df["Int'l Plan"].apply(lambda word: word == 'yes')
df["VMail Plan"] = df["VMail Plan"].apply(lambda word: word == 'yes')
df["Churn?"] = df["Churn?"].apply(lambda word: word != 'False.')
df = df.drop(['State', 'Area Code', 'Phone'], axis=1)
df.head()

Out[8]:

	Account Length	Int'l Plan	VMail Plan	VMail Message	Day Mins	Day Calls	Day Charge	Eve Mins	Eve Calls	Eve Charge	Night Mins	Night Calls	Night Charge	Intl Mins	Intl Calls	Intl Charge	CustServ Calls	Churn?
0	128	False	True	25	265.1	110	45.07	197.4	99	16.78	244.7	91	11.01	10.0	3	2.70	1	False
1	107	False	True	26	161.6	123	27.47	195.5	103	16.62	254.4	103	11.45	13.7	3	3.70	1	False
2	137	False	False	0	243.4	114	41.38	121.2	110	10.30	162.6	104	7.32	12.2	5	3.29	0	False
3	84	True	False	0	299.4	71	50.90	61.9	88	5.26	196.9	89	8.86	6.6	7	1.78	2	False
4	75	True	False	0	166.7	113	28.34	148.3	122	12.61	186.9	121	8.41	10.1	3	2.73	3	False

Next lets split up our x and y so we can ultimately make our train/test split¶

In [9]:

y = df.pop('Churn?').values
X = df.values
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [13]:

rf = RandomForestClassifier(oob_score=True)
rf.fit(X_train, y_train)
y_predicted = rf.predict(X_test)
cf = confusion_matrix(y_test, y_predicted, labels=[True, False])

In [ ]: