In this notebook we will have a look at the census income data from 1996. The dataset can be found is part of the UC Irvines ML repository. Specifically the data can be found at http://archive.ics.uci.edu/ml/machine-learning-databases/adult/.
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
matplotlib.style.use('ggplot')
matplotlib.rc_params_from_file("../styles/matplotlibrc" ).update()
The data is a set of csv files that have been split into a training set and test set already. To support easy read in to a pandas DF we preprocess the files by adding a header that we get from the datasets metadata and adding an index column
import csv
def indexAndAnnotateDataSet(filename):
outname = filename.replace('.csv', '_clean.csv')
with open(outname, 'wb') as csvfile:
f = open(filename,'r')
filewriter = csv.writer(csvfile, delimiter=',')
columns = ['', 'age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'target']
filewriter.writerow(columns)
idx = 0
for line in f:
lst = []
lst.append(str(idx))
for item in line.strip('.').split(','):
lst.append(item.strip().strip('.'))
filewriter = csv.writer(csvfile, delimiter=',')
filewriter.writerow(lst)
idx +=1
f.close()
return outname
Since the datasets are fairly small we download them here
import os
import urllib
url_train_dataset = 'http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
urllib.urlretrieve(url_train_dataset, 'adult.csv')
url_test_dataset = 'http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
urllib.urlretrieve(url_test_dataset, 'adult_test.csv')
clean_DS_1 = indexAndAnnotateDataSet(os.getcwd() +'/adult.csv')
clean_DS_2 = indexAndAnnotateDataSet(os.getcwd() +'/adult_test.csv')
We will use scikit-learn's built-in cross-validation methodes. Hence, we join the training and test data into a larger data frame to increase the size of the dataset. Moreover, we will drop rows that contain na
feature and the column called 'fnlwgt' as this is a control value introduced by the original authors of the dataset.
Inhaling the csv files and appending the data frames we get:
raw_data_df = pd.DataFrame.from_csv(clean_DS_1).dropna().drop('fnlwgt',1)
raw_test_df = pd.DataFrame.from_csv(clean_DS_2).dropna().drop('fnlwgt',1)
raw_data_df
age | workclass | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
1 | 50 | Self-emp-not-inc | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
2 | 38 | Private | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
3 | 53 | Private | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
4 | 28 | Private | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
5 | 37 | Private | Masters | 14 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 0 | 0 | 40 | United-States | <=50K |
6 | 49 | Private | 9th | 5 | Married-spouse-absent | Other-service | Not-in-family | Black | Female | 0 | 0 | 16 | Jamaica | <=50K |
7 | 52 | Self-emp-not-inc | HS-grad | 9 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 45 | United-States | >50K |
8 | 31 | Private | Masters | 14 | Never-married | Prof-specialty | Not-in-family | White | Female | 14084 | 0 | 50 | United-States | >50K |
9 | 42 | Private | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 5178 | 0 | 40 | United-States | >50K |
10 | 37 | Private | Some-college | 10 | Married-civ-spouse | Exec-managerial | Husband | Black | Male | 0 | 0 | 80 | United-States | >50K |
11 | 30 | State-gov | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | India | >50K |
12 | 23 | Private | Bachelors | 13 | Never-married | Adm-clerical | Own-child | White | Female | 0 | 0 | 30 | United-States | <=50K |
13 | 32 | Private | Assoc-acdm | 12 | Never-married | Sales | Not-in-family | Black | Male | 0 | 0 | 50 | United-States | <=50K |
14 | 40 | Private | Assoc-voc | 11 | Married-civ-spouse | Craft-repair | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | ? | >50K |
15 | 34 | Private | 7th-8th | 4 | Married-civ-spouse | Transport-moving | Husband | Amer-Indian-Eskimo | Male | 0 | 0 | 45 | Mexico | <=50K |
16 | 25 | Self-emp-not-inc | HS-grad | 9 | Never-married | Farming-fishing | Own-child | White | Male | 0 | 0 | 35 | United-States | <=50K |
17 | 32 | Private | HS-grad | 9 | Never-married | Machine-op-inspct | Unmarried | White | Male | 0 | 0 | 40 | United-States | <=50K |
18 | 38 | Private | 11th | 7 | Married-civ-spouse | Sales | Husband | White | Male | 0 | 0 | 50 | United-States | <=50K |
19 | 43 | Self-emp-not-inc | Masters | 14 | Divorced | Exec-managerial | Unmarried | White | Female | 0 | 0 | 45 | United-States | >50K |
20 | 40 | Private | Doctorate | 16 | Married-civ-spouse | Prof-specialty | Husband | White | Male | 0 | 0 | 60 | United-States | >50K |
21 | 54 | Private | HS-grad | 9 | Separated | Other-service | Unmarried | Black | Female | 0 | 0 | 20 | United-States | <=50K |
22 | 35 | Federal-gov | 9th | 5 | Married-civ-spouse | Farming-fishing | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
23 | 43 | Private | 11th | 7 | Married-civ-spouse | Transport-moving | Husband | White | Male | 0 | 2042 | 40 | United-States | <=50K |
24 | 59 | Private | HS-grad | 9 | Divorced | Tech-support | Unmarried | White | Female | 0 | 0 | 40 | United-States | <=50K |
25 | 56 | Local-gov | Bachelors | 13 | Married-civ-spouse | Tech-support | Husband | White | Male | 0 | 0 | 40 | United-States | >50K |
26 | 19 | Private | HS-grad | 9 | Never-married | Craft-repair | Own-child | White | Male | 0 | 0 | 40 | United-States | <=50K |
27 | 54 | ? | Some-college | 10 | Married-civ-spouse | ? | Husband | Asian-Pac-Islander | Male | 0 | 0 | 60 | South | >50K |
28 | 39 | Private | HS-grad | 9 | Divorced | Exec-managerial | Not-in-family | White | Male | 0 | 0 | 80 | United-States | <=50K |
29 | 49 | Private | HS-grad | 9 | Married-civ-spouse | Craft-repair | Husband | White | Male | 0 | 0 | 40 | United-States | <=50K |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
32531 | 30 | ? | Bachelors | 13 | Never-married | ? | Not-in-family | Asian-Pac-Islander | Female | 0 | 0 | 99 | United-States | <=50K |
32532 | 34 | Private | Doctorate | 16 | Married-civ-spouse | Prof-specialty | Husband | White | Male | 0 | 0 | 60 | United-States | >50K |
32533 | 54 | Private | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | Asian-Pac-Islander | Male | 0 | 0 | 50 | Japan | >50K |
32534 | 37 | Private | Some-college | 10 | Divorced | Adm-clerical | Unmarried | White | Female | 0 | 0 | 39 | United-States | <=50K |
32535 | 22 | Private | 12th | 8 | Never-married | Protective-serv | Own-child | Black | Male | 0 | 0 | 35 | United-States | <=50K |
32536 | 34 | Private | Bachelors | 13 | Never-married | Exec-managerial | Not-in-family | White | Female | 0 | 0 | 55 | United-States | >50K |
32537 | 30 | Private | HS-grad | 9 | Never-married | Craft-repair | Not-in-family | Black | Male | 0 | 0 | 46 | United-States | <=50K |
32538 | 38 | Private | Bachelors | 13 | Divorced | Prof-specialty | Unmarried | Black | Female | 15020 | 0 | 45 | United-States | >50K |
32539 | 71 | ? | Doctorate | 16 | Married-civ-spouse | ? | Husband | White | Male | 0 | 0 | 10 | United-States | >50K |
32540 | 45 | State-gov | HS-grad | 9 | Separated | Adm-clerical | Own-child | White | Female | 0 | 0 | 40 | United-States | <=50K |
32541 | 41 | ? | HS-grad | 9 | Separated | ? | Not-in-family | Black | Female | 0 | 0 | 32 | United-States | <=50K |
32542 | 72 | ? | HS-grad | 9 | Married-civ-spouse | ? | Husband | White | Male | 0 | 0 | 25 | United-States | <=50K |
32543 | 45 | Local-gov | Assoc-acdm | 12 | Divorced | Prof-specialty | Unmarried | White | Female | 0 | 0 | 48 | United-States | <=50K |
32544 | 31 | Private | Masters | 14 | Divorced | Other-service | Not-in-family | Other | Female | 0 | 0 | 30 | United-States | <=50K |
32545 | 39 | Local-gov | Assoc-acdm | 12 | Married-civ-spouse | Adm-clerical | Wife | White | Female | 0 | 0 | 20 | United-States | >50K |
32546 | 37 | Private | Assoc-acdm | 12 | Divorced | Tech-support | Not-in-family | White | Female | 0 | 0 | 40 | United-States | <=50K |
32547 | 43 | Private | HS-grad | 9 | Married-civ-spouse | Machine-op-inspct | Husband | White | Male | 0 | 0 | 40 | Mexico | <=50K |
32548 | 65 | Self-emp-not-inc | Prof-school | 15 | Never-married | Prof-specialty | Not-in-family | White | Male | 1086 | 0 | 60 | United-States | <=50K |
32549 | 43 | State-gov | Some-college | 10 | Divorced | Adm-clerical | Other-relative | White | Female | 0 | 0 | 40 | United-States | <=50K |
32550 | 43 | Self-emp-not-inc | Some-college | 10 | Married-civ-spouse | Craft-repair | Husband | White | Male | 0 | 0 | 50 | United-States | <=50K |
32551 | 32 | Private | 10th | 6 | Married-civ-spouse | Handlers-cleaners | Husband | Amer-Indian-Eskimo | Male | 0 | 0 | 40 | United-States | <=50K |
32552 | 43 | Private | Assoc-voc | 11 | Married-civ-spouse | Sales | Husband | White | Male | 0 | 0 | 45 | United-States | <=50K |
32553 | 32 | Private | Masters | 14 | Never-married | Tech-support | Not-in-family | Asian-Pac-Islander | Male | 0 | 0 | 11 | Taiwan | <=50K |
32554 | 53 | Private | Masters | 14 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 40 | United-States | >50K |
32555 | 22 | Private | Some-college | 10 | Never-married | Protective-serv | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
32556 | 27 | Private | Assoc-acdm | 12 | Married-civ-spouse | Tech-support | Wife | White | Female | 0 | 0 | 38 | United-States | <=50K |
32557 | 40 | Private | HS-grad | 9 | Married-civ-spouse | Machine-op-inspct | Husband | White | Male | 0 | 0 | 40 | United-States | >50K |
32558 | 58 | Private | HS-grad | 9 | Widowed | Adm-clerical | Unmarried | White | Female | 0 | 0 | 40 | United-States | <=50K |
32559 | 22 | Private | HS-grad | 9 | Never-married | Adm-clerical | Own-child | White | Male | 0 | 0 | 20 | United-States | <=50K |
32560 | 52 | Self-emp-inc | HS-grad | 9 | Married-civ-spouse | Exec-managerial | Wife | White | Female | 15024 | 0 | 40 | United-States | >50K |
32561 rows × 14 columns
Looking at the 'education' and 'education-num' column we might suspect duplicate information. Let's check for that!
raw_data_df[['education', 'education-num']].drop_duplicates().sort('education-num')
education | education-num | |
---|---|---|
224 | Preschool | 1 |
160 | 1st-4th | 2 |
56 | 5th-6th | 3 |
15 | 7th-8th | 4 |
6 | 9th | 5 |
77 | 10th | 6 |
3 | 11th | 7 |
415 | 12th | 8 |
2 | HS-grad | 9 |
10 | Some-college | 10 |
14 | Assoc-voc | 11 |
13 | Assoc-acdm | 12 |
0 | Bachelors | 13 |
5 | Masters | 14 |
52 | Prof-school | 15 |
20 | Doctorate | 16 |
As suspected the two columns are mere duplicates of each other to cast the strings into integer categoricals. In the following we will drop the 'education' column.
All scikit-learn classification algorithms expect numerical features as input. We will hence cast the categoricals to integers and keep the lists areound to resolve classes later on if necessary.
list_of_catList= []
list_of_catList.append(raw_data_df.workclass.unique())
list_of_catList.append(raw_data_df['marital-status'].unique())
list_of_catList.append(raw_data_df.occupation.unique())
list_of_catList.append(raw_data_df.relationship.unique())
list_of_catList.append(raw_data_df.race.unique())
list_of_catList.append(raw_data_df.sex.unique())
list_of_catList.append(raw_data_df['native-country'].unique())
def cleanFeatureDF(feature_df):
df = feature_df
for lst in list_of_catList:
for cat in lst:
df = df.replace(cat, lst.tolist().index(cat))
return df
def cleanTargetDF(target_df):
df = target_df
list_of_target = target_df.unique()
for target in list_of_target:
df = df.replace(target, list_of_target.tolist().index(target))
return df
numerical_data_df = cleanFeatureDF(raw_data_df.drop('target',1).drop('education',1))
numerical_data_df['target'] = cleanTargetDF(raw_data_df.target)
numerical_test_df = cleanFeatureDF(raw_test_df.drop('target',1).drop('education',1))
numerical_test_df['target'] = cleanTargetDF(raw_test_df.target)
As we are dealing with a very dense dataset containing mostly categoricals we might assume that tree classifiers should do well on this problem, so we will look into training a simple tree and a random forest.
But before that we need to split our data frame into training features and the target variables:
train_feature_df = numerical_data_df.drop('target',1)
train_target_df = numerical_data_df.target
Let's learn a simple tree using scikit-learn's tree classifier
import numpy as np
from sklearn import tree
from sklearn.learning_curve import learning_curve
from sklearn.learning_curve import validation_curve
tree_clf = tree.DecisionTreeClassifier()
First we are interested how the naive tree is doing in learning our training set. To this end we record the learning curve of our dataset
def plotLearningCurve(classifier, feature_df, target_df, train_set_sizes):
train_sizes, train_scores, test_scores = learning_curve(classifier,
feature_df,
target_df,
train_sizes=train_set_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
train_sizes=np.linspace(.1, 1.0, 15)
plotLearningCurve(tree_clf, train_feature_df, train_target_df, train_sizes)
Our learning curve is pretty bad. The scores almost don't change at all with varying the data set size. Moreover the curve shows a large gap which hints to a large bias or overfitting of the data set. To improve on the learning we need to investigate the hyper-parameter space for the tree, which in this case is the tree depth. scikit-learn's tree does not impose any maximal tree depth, so let's look at the validation curves using 'accuracy' as a score in the dpeth hyper-parameter space
def plotValidationAccuracyCurve(classifier, feature_df, target_df, param_name, param_range):
train_scores, test_scores = validation_curve(classifier,
feature_df,
target_df,
param_name=param_name,
param_range=param_range,
cv=10,
scoring="accuracy",
n_jobs=1)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.plot(param_range, train_scores_mean, 'o-', label="Training score", color="r")
plt.fill_between(param_range, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.2, color="r")
plt.plot(param_range, test_scores_mean, 'o-', label="Cross-validation score",
color="g")
plt.fill_between(param_range, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.2, color="g")
plt.legend(loc="best")
param_range = range(2,17)
plotValidationAccuracyCurve(tree_clf, train_feature_df, train_target_df, 'max_depth', param_range)
Indeed we see a sweet spot at around depth 12, after which the cross-validation score starts to drop. We will fix the classifier at depth 12 and learn the final classifier
tree_clf.max_depth = 12
tree_clf.fit(train_feature_df, train_target_df)
DecisionTreeClassifier(compute_importances=None, criterion='gini', max_depth=12, max_features=None, max_leaf_nodes=None, min_density=None, min_samples_leaf=1, min_samples_split=2, random_state=None, splitter='best')
RFs have the advantage that their inherent randomness counter-acts the learning of train-set intrisic patterns which don't generalize. They tend not to overfit and are stable against outliers.
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier()
Let's start with having a look at their learning behavior
train_sizes=np.linspace(.1, 1.0, 15)
plotLearningCurve(rf_clf, train_feature_df, train_target_df, train_sizes)
We face the same problems as with the single tree, hence we need to dig into the hyper-parameter space. We explore the maximal depth and the number of trees in the forest.
#Validation curve to find optimal max_depth of trees
param_range = range(2,17)
plotValidationAccuracyCurve(rf_clf, train_feature_df, train_target_df, 'max_depth', param_range)
We find the optimal value at around depth 14. Let's look for the optimal forest size at that maximal depth (Note: we might only find a local optimum or non at all. To find an actual optimal point we would need a grid search in hyper-parameter space)
#Validation curve to find optimal number of trees
param_range = range(2,17)
rf_clf.max_depth = 14
plotValidationAccuracyCurve(rf_clf, train_feature_df, train_target_df, 'n_estimators', param_range)
We see that around a number of 9 trees the scores start to saturate, so we can choose this value as a sweet spot. Let's see how our learning curve looks at this point in hyper-parameter space
train_sizes=np.linspace(.1, 1.0, 15)
rf_clf.max_depth = 14
rf_clf.n_estimators = 9
plotLearningCurve(rf_clf, train_feature_df, train_target_df, train_sizes)
Much better! We see that the curves are starting to converge and haven't saturated yet. We hence reduced our bias and it seems that increasing the data-set size might even improve our classifier.
Finally let's learn the final RF classifier
rf_clf.max_depth = 14
rf_clf.n_estimators = 9
rf_clf.fit(train_feature_df, train_target_df)
RandomForestClassifier(bootstrap=True, compute_importances=None, criterion='gini', max_depth=14, max_features='auto', max_leaf_nodes=None, min_density=None, min_samples_leaf=1, min_samples_split=2, n_estimators=9, n_jobs=1, oob_score=False, random_state=None, verbose=0)
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score
tree_prediction = tree_clf.predict(numerical_test_df.drop('target',1))
tree_confusion_matrix = confusion_matrix(numerical_test_df.target, tree_prediction)
tree_accuracy = accuracy_score(numerical_test_df.target, tree_prediction)
rf_prediction = rf_clf.predict(numerical_test_df.drop('target',1))
rf_confusion_matrix = confusion_matrix(numerical_test_df.target, rf_prediction)
rf_accuracy = accuracy_score(numerical_test_df.target, rf_prediction)
print "Tree accuracy: %0.3f" % tree_accuracy
print "RF accuracy: %0.3f" % rf_accuracy
print "\n"
print "Tree confusion matrix: \n", tree_confusion_matrix
print "RF confusion matrix: \n", rf_confusion_matrix
Tree accuracy: 0.899 RF accuracy: 0.886 Tree confusion matrix: [[23621 1099] [ 2198 5643]] RF confusion matrix: [[23803 917] [ 2810 5031]]
We see that the single tree is slighlty more accurate, but that both classifiers perform close to 90% accurate. Looking at the confusion matrix it becomes obvious that the RF is more careful in classifying positives as the total number of false positives is smaller. Let's look at some more metrics to draw a final conclusion
print "Tree precision: %0.3f" % precision_score(numerical_test_df.target, tree_prediction)
print "RF precision: %0.3f" % precision_score(numerical_test_df.target, rf_prediction)
print "\n"
print "Tree recall: %0.3f" % recall_score(numerical_test_df.target, tree_prediction)
print "RF recall: %0.3f" % recall_score(numerical_test_df.target, rf_prediction)
Tree precision: 0.837 RF precision: 0.846 Tree recall: 0.720 RF recall: 0.642
As already suspected. The RF is more precise than the single tree in terms of assiging the correct class. However the single tree is outperforming the RF in terms of finding true positives in the data.
Finally it is important to understand what features the classifiers learned. In the case of our linear model this will still be easier to interpret. Let's have a look at the feature_importances_
of the classifiers.
feature_imp_df = pd.DataFrame()
feature_imp_df['feature'] = train_feature_df.columns
feature_imp_df['tree_importance'] = tree_clf.feature_importances_/tree_clf.feature_importances_.sum()
feature_imp_df['rf_importance'] = rf_clf.feature_importances_/rf_clf.feature_importances_.sum()
feature_imp_df.sort('rf_importance', ascending = True).plot(x='feature', kind='barh', figsize=(10,7))
<matplotlib.axes._subplots.AxesSubplot at 0x1109317d0>
We immediately see that the tree as well as the RF share the three most important features 'capital-gain', 'education-num' and 'marital-status' to be indicative of whether or not the income is larger than 50K. However the single tree places slightly more importance on those features than the RF which might be connected to some overfitting due to missing randomness that the RF has. Hover the 4th feature is interesting! RF indicates that 'relationship' is a strong indicator of large income, whereas the single tree completely dismisses this feature as unimportant.
Let's look at this feature in more detail!
marital_status_list = list_of_catList[3]
df_to_plot = numerical_data_df[['relationship','target']]
df_to_plot[df_to_plot.target ==1].hist()
df_to_plot[df_to_plot.target ==0].hist()
print marital_status_list
['Not-in-family' 'Husband' 'Wife' 'Own-child' 'Unmarried' 'Other-relative']
We see that this is indeed a feature! Not being in the relationship status 'Husband' significantly decreases your chance of making more than 50K a year! Husband's themselves are relatively safe since their status appears mostly uncorrelated with the money they earn. Intuitively it is clear that RF could pick this up since every tree in the forest might split on a different value of the 'relationship' category and the final voting will outweight the insignificant 'husband' value.
Though the analysis done here is fairly trivial it turns out that using two simple classifiers we are already able to gain insight into a interesting feature. Namley that not being in a relationship status 'Husband' significantly reduces the chances of making more than 50K of income a year. For husbands themselves however, the status is non-decisive. THe feature was detected using a Random Forest classifier but could not be picked up using a single Decision tree.
I hope you find this notebook useful and it motivates you to play some more with it. Contributions are always welcome!
are mostly given throughout the text. But importantly
from IPython.core.display import HTML
def css_styling():
styles = open("../styles/custom.css", "r").read()
return HTML(styles)
css_styling()