Section 1-1 - Filling-in Missing Values¶

In the previous section, we ended up with a smaller set of predictions because we chose to throw away rows with missing values. We build on this approach in this section by filling in the missing data with an educated guess.

We will only provide detailed descriptions on new concepts introduced.

Pandas - Extracting data¶

In [1]:

import pandas as pd
import numpy as np

df = pd.read_csv('data.csv')

df_train = df.iloc[:712, :]
df_test = df.iloc[712:, :]

Pandas - Cleaning data¶

In [2]:

df_train = df_train.drop(['Name', 'Ticket', 'Cabin'], axis=1)

Similar to the previous section, we review the data type and value counts.

In [3]:

df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 0 to 711
Data columns (total 9 columns):
PassengerId    712 non-null int64
Survived       712 non-null int64
Pclass         712 non-null int64
Sex            712 non-null object
Age            565 non-null float64
SibSp          712 non-null int64
Parch          712 non-null int64
Fare           712 non-null float64
Embarked       711 non-null object
dtypes: float64(2), int64(5), object(2)
memory usage: 55.6+ KB

There are a number of ways that we could fill in the NaN values of the column Age. For simplicity, we'll do so by taking the average, or mean, of values of each column.

In [4]:

age_mean = df_train['Age'].mean()
df_train['Age'] = df_train['Age'].fillna(age_mean)

Exercise

Write the code to replace the NaN values by the median, instead of the mean.

Taking the average does not make sense for the column Embarked, as it is a categorical value. Instead, we shall replace the NaN values by the mode, or most frequently occurring value.

In [5]:

from collections import Counter

Counter(df_train['Embarked'])

Out[5]:

Counter({nan: 1, 'C': 138, 'Q': 64, 'S': 509})

In [6]:

df_train['Embarked'] = df_train['Embarked'].fillna('S')

In [7]:

df_train['Sex'] = df_train['Sex'].map({'female': 0, 'male': 1})
df_train['Embarked'] = df_train['Embarked'].map({'C':1, 'S':2, 'Q':3})

We now review details of our training data.

In [8]:

df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 0 to 711
Data columns (total 9 columns):
PassengerId    712 non-null int64
Survived       712 non-null int64
Pclass         712 non-null int64
Sex            712 non-null int64
Age            712 non-null float64
SibSp          712 non-null int64
Parch          712 non-null int64
Fare           712 non-null float64
Embarked       712 non-null int64
dtypes: float64(2), int64(7)
memory usage: 55.6 KB

Hence have we have preserved all the rows of our data set, and proceed to create a numerical array for Scikit-learn.

In [9]:

X_train = df_train.iloc[:, 2:].values
y_train = df_train['Survived']

Scikit-learn - Training the model¶

In [10]:

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 100, random_state=0)
model = model.fit(X_train, y_train)

Scikit-learn - Making predictions¶

We now review what needs to be cleaned in the test data.

In [11]:

df_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 179 entries, 712 to 890
Data columns (total 12 columns):
PassengerId    179 non-null int64
Survived       179 non-null int64
Pclass         179 non-null int64
Name           179 non-null object
Sex            179 non-null object
Age            149 non-null float64
SibSp          179 non-null int64
Parch          179 non-null int64
Ticket         179 non-null object
Fare           179 non-null float64
Cabin          42 non-null object
Embarked       178 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 18.2+ KB

In [12]:

df_test = df_test.drop(['Name', 'Ticket', 'Cabin'], axis=1)

As per our previous approach, we fill in the NaN values in the column Age and Embarked with the mean and mode respectively.

In [13]:

df_test['Age'] = df_test['Age'].fillna(age_mean)
df_test['Embarked'] = df_test['Embarked'].fillna('S')

In [14]:

df_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 179 entries, 712 to 890
Data columns (total 9 columns):
PassengerId    179 non-null int64
Survived       179 non-null int64
Pclass         179 non-null int64
Sex            179 non-null object
Age            179 non-null float64
SibSp          179 non-null int64
Parch          179 non-null int64
Fare           179 non-null float64
Embarked       179 non-null object
dtypes: float64(2), int64(5), object(2)
memory usage: 14.0+ KB

In [15]:

df_test['Sex'] = df_test['Sex'].map({'female': 0, 'male': 1}).astype(int)
df_test['Embarked'] = df_test['Embarked'].map({'C':1, 'S':2, 'Q':3})

X_test = df_test.iloc[:, 2:]
y_test = df_test['Survived']

y_prediction = model.predict(X_test)

Evaluation¶

As before, we calculate the model's accuracy:

In [16]:

np.sum(y_prediction == y_test) / float(len(y_test))

Out[16]:

0.81564245810055869

While this is slightly less than our previous approach, our current approach preserves the number of predictions to be made.

In [17]:

len(y_test)

Out[17]:

More importantly, all the training data was used to train our model. By ignoring rows with missing value, we are essentially throwing away information that can be used.