In previous sections, we replaced the categorical values {C, S, Q} in the column Embarked by the numerical values {1, 2, 3}. The latter, however, has a notion of ordering not present in the former (which is simply arranged in alphabetical order). To get around this problem, we shall introduce the concept of dummy variables.
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
df_train = df.iloc[:712, :]
df_test = df.iloc[712:, :]
df_train = df_train.drop(['Name', 'Ticket', 'Cabin'], axis=1)
age_mean = df_train['Age'].mean()
df_train['Age'] = df_train['Age'].fillna(age_mean)
df_train['Embarked'] = df_train['Embarked'].fillna('S')
As there are only two unique values for the column Sex, we have no problems of ordering.
df_train['Sex'] = df_train['Sex'].map({'female': 0, 'male': 1})
For the column Embarked, however, replacing {C, S, Q} by {1, 2, 3} would seem to imply the ordering C < S < Q when in fact they are simply arranged alphabetically.
To avoid this problem, we create dummy variables. Essentially this involves creating new columns to represent whether the passenger embarked at C with the value 1 if true, 0 otherwise. Pandas has a built-in function to create these columns automatically.
pd.get_dummies(df_train['Embarked'], prefix='Embarked').head(10)
Embarked_C | Embarked_Q | Embarked_S | |
---|---|---|---|
0 | 0 | 0 | 1 |
1 | 1 | 0 | 0 |
2 | 0 | 0 | 1 |
3 | 0 | 0 | 1 |
4 | 0 | 0 | 1 |
5 | 0 | 1 | 0 |
6 | 0 | 0 | 1 |
7 | 0 | 0 | 1 |
8 | 0 | 0 | 1 |
9 | 1 | 0 | 0 |
We now concatenate the columns containing the dummy variables to our main dataframe.
df_train = pd.concat([df_train, pd.get_dummies(df_train['Embarked'], prefix='Embarked')], axis=1)
Exercise
df_train = df_train.drop(['Embarked'], axis=1)
We review our processed training data.
df_train.head(10)
PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked_C | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | 1 | 22.000000 | 1 | 0 | 7.2500 | 0 | 0 | 1 |
1 | 2 | 1 | 1 | 0 | 38.000000 | 1 | 0 | 71.2833 | 1 | 0 | 0 |
2 | 3 | 1 | 3 | 0 | 26.000000 | 0 | 0 | 7.9250 | 0 | 0 | 1 |
3 | 4 | 1 | 1 | 0 | 35.000000 | 1 | 0 | 53.1000 | 0 | 0 | 1 |
4 | 5 | 0 | 3 | 1 | 35.000000 | 0 | 0 | 8.0500 | 0 | 0 | 1 |
5 | 6 | 0 | 3 | 1 | 30.030531 | 0 | 0 | 8.4583 | 0 | 1 | 0 |
6 | 7 | 0 | 1 | 1 | 54.000000 | 0 | 0 | 51.8625 | 0 | 0 | 1 |
7 | 8 | 0 | 3 | 1 | 2.000000 | 3 | 1 | 21.0750 | 0 | 0 | 1 |
8 | 9 | 1 | 3 | 0 | 27.000000 | 0 | 2 | 11.1333 | 0 | 0 | 1 |
9 | 10 | 1 | 2 | 0 | 14.000000 | 1 | 0 | 30.0708 | 1 | 0 | 0 |
X_train = df_train.iloc[:, 2:].values
y_train = df_train['Survived']
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=0)
model = model.fit(X_train, y_train)
df_test = df_test.drop(['Name', 'Ticket', 'Cabin'], axis=1)
df_test['Age'] = df_test['Age'].fillna(age_mean)
df_test['Embarked'] = df_test['Embarked'].fillna('S')
df_test['Sex'] = df_test['Sex'].map({'female': 0, 'male': 1})
Similarly we create dummy variables for the test data.
df_test = pd.concat([df_test, pd.get_dummies(df_test['Embarked'], prefix='Embarked')], axis=1)
df_test = df_test.drop(['Embarked'], axis=1)
X_test = df_test.iloc[:, 2:]
y_test = df_test['Survived']
y_prediction = model.predict(X_test)
np.sum(y_prediction == y_test) / float(len(y_test))
0.83798882681564246