In the previous section, we ended up with a smaller set of predictions because we chose to throw away rows with missing values. We build on this approach in this section by filling in the missing data with an educated guess.
We will only provide detailed descriptions on new concepts introduced.
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
df_train = df.iloc[:712, :]
df_test = df.iloc[712:, :]
df_train = df_train.drop(['Name', 'Ticket', 'Cabin'], axis=1)
Similar to the previous section, we review the data type and value counts.
df_train.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 712 entries, 0 to 711 Data columns (total 9 columns): PassengerId 712 non-null int64 Survived 712 non-null int64 Pclass 712 non-null int64 Sex 712 non-null object Age 565 non-null float64 SibSp 712 non-null int64 Parch 712 non-null int64 Fare 712 non-null float64 Embarked 711 non-null object dtypes: float64(2), int64(5), object(2) memory usage: 55.6+ KB
There are a number of ways that we could fill in the NaN values of the column Age. For simplicity, we'll do so by taking the average, or mean, of values of each column.
age_mean = df_train['Age'].mean()
df_train['Age'] = df_train['Age'].fillna(age_mean)
Exercise
Taking the average does not make sense for the column Embarked, as it is a categorical value. Instead, we shall replace the NaN values by the mode, or most frequently occurring value.
from collections import Counter
Counter(df_train['Embarked'])
Counter({nan: 1, 'C': 138, 'Q': 64, 'S': 509})
df_train['Embarked'] = df_train['Embarked'].fillna('S')
df_train['Sex'] = df_train['Sex'].map({'female': 0, 'male': 1})
df_train['Embarked'] = df_train['Embarked'].map({'C':1, 'S':2, 'Q':3})
We now review details of our training data.
df_train.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 712 entries, 0 to 711 Data columns (total 9 columns): PassengerId 712 non-null int64 Survived 712 non-null int64 Pclass 712 non-null int64 Sex 712 non-null int64 Age 712 non-null float64 SibSp 712 non-null int64 Parch 712 non-null int64 Fare 712 non-null float64 Embarked 712 non-null int64 dtypes: float64(2), int64(7) memory usage: 55.6 KB
Hence have we have preserved all the rows of our data set, and proceed to create a numerical array for Scikit-learn.
X_train = df_train.iloc[:, 2:].values
y_train = df_train['Survived']
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators = 100, random_state=0)
model = model.fit(X_train, y_train)
We now review what needs to be cleaned in the test data.
df_test.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 179 entries, 712 to 890 Data columns (total 12 columns): PassengerId 179 non-null int64 Survived 179 non-null int64 Pclass 179 non-null int64 Name 179 non-null object Sex 179 non-null object Age 149 non-null float64 SibSp 179 non-null int64 Parch 179 non-null int64 Ticket 179 non-null object Fare 179 non-null float64 Cabin 42 non-null object Embarked 178 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 18.2+ KB
df_test = df_test.drop(['Name', 'Ticket', 'Cabin'], axis=1)
As per our previous approach, we fill in the NaN values in the column Age and Embarked with the mean and mode respectively.
df_test['Age'] = df_test['Age'].fillna(age_mean)
df_test['Embarked'] = df_test['Embarked'].fillna('S')
df_test.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 179 entries, 712 to 890 Data columns (total 9 columns): PassengerId 179 non-null int64 Survived 179 non-null int64 Pclass 179 non-null int64 Sex 179 non-null object Age 179 non-null float64 SibSp 179 non-null int64 Parch 179 non-null int64 Fare 179 non-null float64 Embarked 179 non-null object dtypes: float64(2), int64(5), object(2) memory usage: 14.0+ KB
df_test['Sex'] = df_test['Sex'].map({'female': 0, 'male': 1}).astype(int)
df_test['Embarked'] = df_test['Embarked'].map({'C':1, 'S':2, 'Q':3})
X_test = df_test.iloc[:, 2:]
y_test = df_test['Survived']
y_prediction = model.predict(X_test)
As before, we calculate the model's accuracy:
np.sum(y_prediction == y_test) / float(len(y_test))
0.81564245810055869
While this is slightly less than our previous approach, our current approach preserves the number of predictions to be made.
len(y_test)
179
More importantly, all the training data was used to train our model. By ignoring rows with missing value, we are essentially throwing away information that can be used.