The end of week for marked the half way point of the structured curriculum portion of the program. The entire cohort is starting to get the drill down. Learn about a topic in the morning, impliment it by hand to see the nuts a bolts, and then finish up with the sklearn version to validate our results. Supervised Learning is all about classifying data to a category, or most often, to a 0 or a 1. For example, if you have data about high school students and you know if they got into college or not, you can use a model to predict whether a current high school student will get into a college.
Topics of the week:
For our code sample this week we are going to use Random Forests on a cell phone data set. We are going to try and predict if customers will churn or not based off of their cell usage statistics. Lets import our packages and see what our data looks like
import pandas as pd
import matplotlib.pyplot as plt
from roc import plot_roc
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
%matplotlib inline
df = pd.read_csv('data/churn.csv')
df["Int'l Plan"] = df["Int'l Plan"].apply(lambda word: word == 'yes')
df["VMail Plan"] = df["VMail Plan"].apply(lambda word: word == 'yes')
df["Churn?"] = df["Churn?"].apply(lambda word: word != 'False.')
df = df.drop(['State', 'Area Code', 'Phone'], axis=1)
df.head()
Account Length | Int'l Plan | VMail Plan | VMail Message | Day Mins | Day Calls | Day Charge | Eve Mins | Eve Calls | Eve Charge | Night Mins | Night Calls | Night Charge | Intl Mins | Intl Calls | Intl Charge | CustServ Calls | Churn? | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 128 | False | True | 25 | 265.1 | 110 | 45.07 | 197.4 | 99 | 16.78 | 244.7 | 91 | 11.01 | 10.0 | 3 | 2.70 | 1 | False |
1 | 107 | False | True | 26 | 161.6 | 123 | 27.47 | 195.5 | 103 | 16.62 | 254.4 | 103 | 11.45 | 13.7 | 3 | 3.70 | 1 | False |
2 | 137 | False | False | 0 | 243.4 | 114 | 41.38 | 121.2 | 110 | 10.30 | 162.6 | 104 | 7.32 | 12.2 | 5 | 3.29 | 0 | False |
3 | 84 | True | False | 0 | 299.4 | 71 | 50.90 | 61.9 | 88 | 5.26 | 196.9 | 89 | 8.86 | 6.6 | 7 | 1.78 | 2 | False |
4 | 75 | True | False | 0 | 166.7 | 113 | 28.34 | 148.3 | 122 | 12.61 | 186.9 | 121 | 8.41 | 10.1 | 3 | 2.73 | 3 | False |
y = df.pop('Churn?').values
X = df.values
X_train, X_test, y_train, y_test = train_test_split(X, y)
rf = RandomForestClassifier(oob_score=True)
rf.fit(X_train, y_train)
y_predicted = rf.predict(X_test)
cf = confusion_matrix(y_test, y_predicted, labels=[True, False])