Pandas is a Python library that contains high-level data structures and manipulation tools designed for data analysis. Think of Pandas as a Python version of Excel. Scikit-learn, on the other hand, is an open-source machine learning library for Python.
While Scikit-learn does a lot of the heavy lifting, what's equally important is ensuring that raw data is processed in such a way that we are able to 'feed' it to Scikit-learn. Hence, the ability to manipulate raw data with Pandas makes it an indispensible part of our toolkit.
Kaggle is the leading platform for data science competitions. Participants compete for cash prizes by submitting the best predictive model to problems posted on the competition website.
https://www.kaggle.com/competitions
We will be reviewing the data from the Kaggle Titanic competition. Our aim is to make predictions on whether or not specific passengers on the Titanic survived, based on characteristics such as age, sex and class.
We will start by splitting the data into a training set and a test set. Next we process the training data, at which point the data will be used to 'train' (or 'fit') our model. With the trained model, we apply it to the test data to make the predictions. Finally, we then compare our predictions against the 'ground truth' to see how well our model performed.
It is very common to encounter missing values in a data set. In this section, we will take the simplest (or perhaps, simplistic) approach of ignoring the whole row if any part of it contains an NaN value. We will build on this approach in later sections.
First, we load the training data from a .csv file. This is the similar to the data found on the Kaggle website:
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
We review the size of the data.
df.shape
(891, 12)
We now split the data into an 80% training set and 20% test set.
df_train = df.iloc[:712, :]
df_test = df.iloc[712:, :]
We review a selection of the data.
df_train.head(10)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54 | 0 | 0 | 17463 | 51.8625 | E46 | S |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2 | 3 | 1 | 349909 | 21.0750 | NaN | S |
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27 | 0 | 2 | 347742 | 11.1333 | NaN | S |
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14 | 1 | 0 | 237736 | 30.0708 | NaN | C |
We notice that the columns describe features of the Titanic passengers, such as age, sex, and class. Of particular interest is the column Survived, which describes whether or not the passenger survived. When training our model, what we are essentially doing is assessing how each feature impacts whether or not the passenger survived (or if the feature makes an impact at all).
Exercise:
We observe that the columns Name, Ticket and Cabin are, for our current purposes, irrelevant. We proceed to remove them from our data set.
df_train = df_train.drop(['Name', 'Ticket', 'Cabin'], axis=1)
Next, we review the type of data in the columns, and their respective counts.
df_train.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 712 entries, 0 to 711 Data columns (total 9 columns): PassengerId 712 non-null int64 Survived 712 non-null int64 Pclass 712 non-null int64 Sex 712 non-null object Age 565 non-null float64 SibSp 712 non-null int64 Parch 712 non-null int64 Fare 712 non-null float64 Embarked 711 non-null object dtypes: float64(2), int64(5), object(2) memory usage: 55.6+ KB
We notice that the columns Age and Embarked have NaNs or missing values. As previously discussed, we take the approach of simply removing the rows with missing values.
df_train = df_train.dropna()
Question
Scikit-learn only takes numerical arrays as inputs. As such, we would need to convert the categorical columns Sex and Embarked into numerical ones. We first review the range of values for the column Sex, and map the string values to numbers.
df_train['Sex'].unique()
array(['male', 'female'], dtype=object)
df_train['Sex'] = df_train['Sex'].map({'female':0, 'male':1})
Similarly for Embarked, we review the range of values and map the string values to a numerical value that represents where the passenger embarked from.
df_train['Embarked'].unique()
array(['S', 'C', 'Q'], dtype=object)
df_train['Embarked'] = df_train['Embarked'].map({'C':1, 'S':2, 'Q':3})
Question
In our final review of our training data, we check that (1) there are no NaN values, and (2) all the values are in numerical form.
df_train.head(10)
PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | 1 | 22 | 1 | 0 | 7.2500 | 2 |
1 | 2 | 1 | 1 | 0 | 38 | 1 | 0 | 71.2833 | 1 |
2 | 3 | 1 | 3 | 0 | 26 | 0 | 0 | 7.9250 | 2 |
3 | 4 | 1 | 1 | 0 | 35 | 1 | 0 | 53.1000 | 2 |
4 | 5 | 0 | 3 | 1 | 35 | 0 | 0 | 8.0500 | 2 |
6 | 7 | 0 | 1 | 1 | 54 | 0 | 0 | 51.8625 | 2 |
7 | 8 | 0 | 3 | 1 | 2 | 3 | 1 | 21.0750 | 2 |
8 | 9 | 1 | 3 | 0 | 27 | 0 | 2 | 11.1333 | 2 |
9 | 10 | 1 | 2 | 0 | 14 | 1 | 0 | 30.0708 | 1 |
10 | 11 | 1 | 3 | 0 | 4 | 1 | 1 | 16.7000 | 2 |
df_train.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 564 entries, 0 to 710 Data columns (total 9 columns): PassengerId 564 non-null int64 Survived 564 non-null int64 Pclass 564 non-null int64 Sex 564 non-null int64 Age 564 non-null float64 SibSp 564 non-null int64 Parch 564 non-null int64 Fare 564 non-null float64 Embarked 564 non-null int64 dtypes: float64(2), int64(7) memory usage: 44.1 KB
Finally, we convert the processed training data from a Pandas dataframe into a numerical (Numpy) array, and create a column from the outcomes of the training data.
X_train = df_train.iloc[:, 2:].values
y_train = df_train['Survived']
In this section, we'll simply use the model as a black box. We'll review more sophisticated techniques in later sections.
In particular, we'll be using the Random Forest model. The intuition is as follows: each feature is reviewed to see how much impact it makes to the outcome. The most prominent feature is segmented into a 'branch'. A collection of branches is a 'tree'. The Random Forest model, broadly speaking, creates a 'forest' of trees and aggregates the results.
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=0)
We use the processed training data to 'train' (or 'fit') our model.
model = model.fit(X_train, y_train)
We now review a selection of the test data.
df_test.head(10)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
712 | 713 | 1 | 1 | Taylor, Mr. Elmer Zebley | male | 48 | 1 | 0 | 19996 | 52.0000 | C126 | S |
713 | 714 | 0 | 3 | Larsson, Mr. August Viktor | male | 29 | 0 | 0 | 7545 | 9.4833 | NaN | S |
714 | 715 | 0 | 2 | Greenberg, Mr. Samuel | male | 52 | 0 | 0 | 250647 | 13.0000 | NaN | S |
715 | 716 | 0 | 3 | Soholt, Mr. Peter Andreas Lauritz Andersen | male | 19 | 0 | 0 | 348124 | 7.6500 | F G73 | S |
716 | 717 | 1 | 1 | Endres, Miss. Caroline Louise | female | 38 | 0 | 0 | PC 17757 | 227.5250 | C45 | C |
717 | 718 | 1 | 2 | Troutt, Miss. Edwina Celia "Winnie" | female | 27 | 0 | 0 | 34218 | 10.5000 | E101 | S |
718 | 719 | 0 | 3 | McEvoy, Mr. Michael | male | NaN | 0 | 0 | 36568 | 15.5000 | NaN | Q |
719 | 720 | 0 | 3 | Johnson, Mr. Malkolm Joackim | male | 33 | 0 | 0 | 347062 | 7.7750 | NaN | S |
720 | 721 | 1 | 2 | Harper, Miss. Annie Jessie "Nina" | female | 6 | 0 | 1 | 248727 | 33.0000 | NaN | S |
721 | 722 | 0 | 3 | Jensen, Mr. Svend Lauritz | male | 17 | 1 | 0 | 350048 | 7.0542 | NaN | S |
As before, we process the test data in a similar fashion to what we did to the training data.
df_test = df_test.drop(['Name', 'Ticket', 'Cabin'], axis=1)
df_test = df_test.dropna()
df_test['Sex'] = df_test['Sex'].map({'female': 0, 'male':1})
df_test['Embarked'] = df_test['Embarked'].map({'C':1, 'S':2, 'Q':3})
X_test = df_test.iloc[:, 2:]
y_test = df_test['Survived']
We now apply the trained model to the test data (omitting the column PassengerId) to produce an output of predictions.
y_prediction = model.predict(X_test)
Comparing our predictions against the actual values gives us a list of 0s and 1s, and adding up the elements of the list gives us the number of correct predictions.
np.sum(y_prediction == y_test)
123
To get a sense of how good our prediction is, we calculate the model's accuracy by dividing the number of correct predictions by the length of the array of actual values.
np.sum(y_prediction == y_test) / float(len(y_test))
0.83108108108108103
Hence our predictions are 84% accurate. We now compare this against our best guess, by looking at the proportion of 0s and 1s.
np.sum(y_test) / float(len(y_test))
0.39189189189189189
Hence 39% of the passengers survived (with value 1) and 61% did not survive. If we were to guess that all the passengers did not survive, we would have a 61% accuracy. Hence our model gives an improvement of 23%!
In this section, we took the simplest approach of ignoring missing values. We look to build on this approach in Section 1-1.