Section 1-2 - Creating Dummy Variables¶

In previous sections, we replaced the categorical values {C, S, Q} in the column Embarked by the numerical values {1, 2, 3}. The latter, however, has a notion of ordering not present in the former (which is simply arranged in alphabetical order). To get around this problem, we shall introduce the concept of dummy variables.

Pandas - Extracting data¶

In [1]:

import pandas as pd
import numpy as np

df = pd.read_csv('data.csv')

df_train = df.iloc[:712, :]
df_test = df.iloc[712:, :]

Pandas - Cleaning data¶

In [2]:

df_train = df_train.drop(['Name', 'Ticket', 'Cabin'], axis=1)

age_mean = df_train['Age'].mean()
df_train['Age'] = df_train['Age'].fillna(age_mean)

df_train['Embarked'] = df_train['Embarked'].fillna('S')

As there are only two unique values for the column Sex, we have no problems of ordering.

In [3]:

df_train['Sex'] = df_train['Sex'].map({'female': 0, 'male': 1})

For the column Embarked, however, replacing {C, S, Q} by {1, 2, 3} would seem to imply the ordering C < S < Q when in fact they are simply arranged alphabetically.

To avoid this problem, we create dummy variables. Essentially this involves creating new columns to represent whether the passenger embarked at C with the value 1 if true, 0 otherwise. Pandas has a built-in function to create these columns automatically.

In [4]:

pd.get_dummies(df_train['Embarked'], prefix='Embarked').head(10)

Out[4]:

	Embarked_C	Embarked_Q	Embarked_S
0	0	0	1
1	1	0	0
2	0	0	1
3	0	0	1
4	0	0	1
5	0	1	0
6	0	0	1
7	0	0	1
8	0	0	1
9	1	0	0

We now concatenate the columns containing the dummy variables to our main dataframe.

In [5]:

df_train = pd.concat([df_train, pd.get_dummies(df_train['Embarked'], prefix='Embarked')], axis=1)

Exercise

Write the code to create dummy variables for the column Sex.

In [6]:

df_train = df_train.drop(['Embarked'], axis=1)

We review our processed training data.

In [7]:

df_train.head(10)

Out[7]:

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked_C	Embarked_Q	Embarked_S
0	1	0	3	1	22.000000	1	0	7.2500	0	0	1
1	2	1	1	0	38.000000	1	0	71.2833	1	0	0
2	3	1	3	0	26.000000	0	0	7.9250	0	0	1
3	4	1	1	0	35.000000	1	0	53.1000	0	0	1
4	5	0	3	1	35.000000	0	0	8.0500	0	0	1
5	6	0	3	1	30.030531	0	0	8.4583	0	1	0
6	7	0	1	1	54.000000	0	0	51.8625	0	0	1
7	8	0	3	1	2.000000	3	1	21.0750	0	0	1
8	9	1	3	0	27.000000	0	2	11.1333	0	0	1
9	10	1	2	0	14.000000	1	0	30.0708	1	0	0

In [8]:

X_train = df_train.iloc[:, 2:].values
y_train = df_train['Survived']

Scikit-learn - Training the model¶

In [9]:

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=0)
model = model.fit(X_train, y_train)

Scikit-learn - Making predictions¶

In [10]:

df_test = df_test.drop(['Name', 'Ticket', 'Cabin'], axis=1)

df_test['Age'] = df_test['Age'].fillna(age_mean)
df_test['Embarked'] = df_test['Embarked'].fillna('S')

df_test['Sex'] = df_test['Sex'].map({'female': 0, 'male': 1})

Similarly we create dummy variables for the test data.

In [11]:

df_test = pd.concat([df_test, pd.get_dummies(df_test['Embarked'], prefix='Embarked')], axis=1)

In [12]:

df_test = df_test.drop(['Embarked'], axis=1)

X_test = df_test.iloc[:, 2:]
y_test = df_test['Survived']

y_prediction = model.predict(X_test)

Evaluation¶

In [13]:

np.sum(y_prediction == y_test) / float(len(y_test))

Out[13]:

0.83798882681564246