Pandas and Scikit-learn¶

Pandas is a Python library that contains high-level data structures and manipulation tools designed for data analysis. Think of Pandas as a Python version of Excel. Scikit-learn, on the other hand, is an open-source machine learning library for Python.

While Scikit-learn does a lot of the heavy lifting, what's equally important is ensuring that raw data is processed in such a way that we are able to 'feed' it to Scikit-learn. Hence, the ability to manipulate raw data with Pandas makes it an indispensible part of our toolkit.

Kaggle¶

Kaggle is the leading platform for data science competitions. Participants compete for cash prizes by submitting the best predictive model to problems posted on the competition website.

https://www.kaggle.com/competitions

We will be reviewing the data from the Kaggle Titanic competition. Our aim is to make predictions on whether or not specific passengers on the Titanic survived, based on characteristics such as age, sex and class.

Section 1-0 - First Cut¶

We will start by splitting the data into a training set and a test set. Next we process the training data, at which point the data will be used to 'train' (or 'fit') our model. With the trained model, we apply it to the test data to make the predictions. Finally, we then compare our predictions against the 'ground truth' to see how well our model performed.

It is very common to encounter missing values in a data set. In this section, we will take the simplest (or perhaps, simplistic) approach of ignoring the whole row if any part of it contains an NaN value. We will build on this approach in later sections.

Pandas - Extracting data¶

First, we load the training data from a .csv file. This is the similar to the data found on the Kaggle website:

https://www.kaggle.com/c/titanic/data

In [1]:

import pandas as pd
import numpy as np

df = pd.read_csv('data.csv')

We review the size of the data.

In [2]:

df.shape

Out[2]:

(891, 12)

We now split the data into an 80% training set and 20% test set.

In [3]:

df_train = df.iloc[:712, :]
df_test = df.iloc[712:, :]

Pandas - Cleaning data¶

We review a selection of the data.

In [4]:

df_train.head(10)

Out[4]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2	3	1	349909	21.0750	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14	1	0	237736	30.0708	NaN	C

We notice that the columns describe features of the Titanic passengers, such as age, sex, and class. Of particular interest is the column Survived, which describes whether or not the passenger survived. When training our model, what we are essentially doing is assessing how each feature impacts whether or not the passenger survived (or if the feature makes an impact at all).

Exercise:

Write the code to review the tail-end section of the data.

We observe that the columns Name, Ticket and Cabin are, for our current purposes, irrelevant. We proceed to remove them from our data set.

In [5]:

df_train = df_train.drop(['Name', 'Ticket', 'Cabin'], axis=1)

Next, we review the type of data in the columns, and their respective counts.

In [6]:

df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 0 to 711
Data columns (total 9 columns):
PassengerId    712 non-null int64
Survived       712 non-null int64
Pclass         712 non-null int64
Sex            712 non-null object
Age            565 non-null float64
SibSp          712 non-null int64
Parch          712 non-null int64
Fare           712 non-null float64
Embarked       711 non-null object
dtypes: float64(2), int64(5), object(2)
memory usage: 55.6+ KB

We notice that the columns Age and Embarked have NaNs or missing values. As previously discussed, we take the approach of simply removing the rows with missing values.

In [7]:

df_train = df_train.dropna()

Question

If you were to fill in the missing values, with what values would you fill them with? Why?

Scikit-learn only takes numerical arrays as inputs. As such, we would need to convert the categorical columns Sex and Embarked into numerical ones. We first review the range of values for the column Sex, and map the string values to numbers.

In [8]:

df_train['Sex'].unique()

Out[8]:

array(['male', 'female'], dtype=object)

In [9]:

df_train['Sex'] = df_train['Sex'].map({'female':0, 'male':1})

Similarly for Embarked, we review the range of values and map the string values to a numerical value that represents where the passenger embarked from.

In [10]:

df_train['Embarked'].unique()

Out[10]:

array(['S', 'C', 'Q'], dtype=object)

In [11]:

df_train['Embarked'] = df_train['Embarked'].map({'C':1, 'S':2, 'Q':3})

Question

What problems might we encounter by mapping C, S, and Q in the column Embarked to the values 1, 2, and 3? In other words, what does the ordering imply? Does the same problem exist for the column Sex?

In our final review of our training data, we check that (1) there are no NaN values, and (2) all the values are in numerical form.

In [12]:

df_train.head(10)

Out[12]:

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked
0	1	0	3	1	22	1	0	7.2500	2
1	2	1	1	0	38	1	0	71.2833	1
2	3	1	3	0	26	0	0	7.9250	2
3	4	1	1	0	35	1	0	53.1000	2
4	5	0	3	1	35	0	0	8.0500	2
6	7	0	1	1	54	0	0	51.8625	2
7	8	0	3	1	2	3	1	21.0750	2
8	9	1	3	0	27	0	2	11.1333	2
9	10	1	2	0	14	1	0	30.0708	1
10	11	1	3	0	4	1	1	16.7000	2

In [13]:

df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 564 entries, 0 to 710
Data columns (total 9 columns):
PassengerId    564 non-null int64
Survived       564 non-null int64
Pclass         564 non-null int64
Sex            564 non-null int64
Age            564 non-null float64
SibSp          564 non-null int64
Parch          564 non-null int64
Fare           564 non-null float64
Embarked       564 non-null int64
dtypes: float64(2), int64(7)
memory usage: 44.1 KB

Finally, we convert the processed training data from a Pandas dataframe into a numerical (Numpy) array, and create a column from the outcomes of the training data.

In [14]:

X_train = df_train.iloc[:, 2:].values
y_train = df_train['Survived']

Scikit-learn - Training the model¶

In this section, we'll simply use the model as a black box. We'll review more sophisticated techniques in later sections.

In particular, we'll be using the Random Forest model. The intuition is as follows: each feature is reviewed to see how much impact it makes to the outcome. The most prominent feature is segmented into a 'branch'. A collection of branches is a 'tree'. The Random Forest model, broadly speaking, creates a 'forest' of trees and aggregates the results.

http://en.wikipedia.org/wiki/Random_forest

In [15]:

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=0)

We use the processed training data to 'train' (or 'fit') our model.

In [16]:

model = model.fit(X_train, y_train)

Scikit-learn - Making predictions¶

We now review a selection of the test data.

In [17]:

df_test.head(10)

Out[17]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
712	713	1	1	Taylor, Mr. Elmer Zebley	male	48	1	0	19996	52.0000	C126	S
713	714	0	3	Larsson, Mr. August Viktor	male	29	0	0	7545	9.4833	NaN	S
714	715	0	2	Greenberg, Mr. Samuel	male	52	0	0	250647	13.0000	NaN	S
715	716	0	3	Soholt, Mr. Peter Andreas Lauritz Andersen	male	19	0	0	348124	7.6500	F G73	S
716	717	1	1	Endres, Miss. Caroline Louise	female	38	0	0	PC 17757	227.5250	C45	C
717	718	1	2	Troutt, Miss. Edwina Celia "Winnie"	female	27	0	0	34218	10.5000	E101	S
718	719	0	3	McEvoy, Mr. Michael	male	NaN	0	0	36568	15.5000	NaN	Q
719	720	0	3	Johnson, Mr. Malkolm Joackim	male	33	0	0	347062	7.7750	NaN	S
720	721	1	2	Harper, Miss. Annie Jessie "Nina"	female	6	0	1	248727	33.0000	NaN	S
721	722	0	3	Jensen, Mr. Svend Lauritz	male	17	1	0	350048	7.0542	NaN	S

As before, we process the test data in a similar fashion to what we did to the training data.

In [18]:

df_test = df_test.drop(['Name', 'Ticket', 'Cabin'], axis=1)

df_test = df_test.dropna()

df_test['Sex'] = df_test['Sex'].map({'female': 0, 'male':1})
df_test['Embarked'] = df_test['Embarked'].map({'C':1, 'S':2, 'Q':3})

X_test = df_test.iloc[:, 2:]
y_test = df_test['Survived']

We now apply the trained model to the test data (omitting the column PassengerId) to produce an output of predictions.

In [19]:

y_prediction = model.predict(X_test)

Evaluation¶

Comparing our predictions against the actual values gives us a list of 0s and 1s, and adding up the elements of the list gives us the number of correct predictions.

In [20]:

np.sum(y_prediction == y_test)

Out[20]:

To get a sense of how good our prediction is, we calculate the model's accuracy by dividing the number of correct predictions by the length of the array of actual values.

In [21]:

np.sum(y_prediction == y_test) / float(len(y_test))

Out[21]:

0.83108108108108103

Hence our predictions are 84% accurate. We now compare this against our best guess, by looking at the proportion of 0s and 1s.

In [22]:

np.sum(y_test) / float(len(y_test))

Out[22]:

0.39189189189189189

Hence 39% of the passengers survived (with value 1) and 61% did not survive. If we were to guess that all the passengers did not survive, we would have a 61% accuracy. Hence our model gives an improvement of 23%!

In this section, we took the simplest approach of ignoring missing values. We look to build on this approach in Section 1-1.