When i started participating in Kaggle competitions. I started with this problem.I feel this is a very easy problem to start learning Data Science. So i thought let me share how i approached this problem. Please feel free to share your comments. I have used IPython to solve this problem. then converted ipython notebook into html file. So if you feel the UI of this page is screwed up. you can see this notebook here
The Problem Statement is very simple. By seeing some example data about people who survived and who died in Titanic,we need to predict that given a new person's data , wether that person will be saved or not. You can read more about this problem statement here
You can also get the data from same location. There are two important files there.
train.csv is the file which contain examples. we will analyze this data and create a mode which will know pattrens about people who were saved and who died. This execise needs python, ipython, numpy, pandas, matplotlib, sklearn to be installed on your machine. I will shortly create a post which has details to install all these.
Before Starting, Let us first try to know that what kind of columns we have in our data.
survival Survival (0 = No; 1 = Yes) pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) name Name sex Sex age Age sibsp Number of Siblings/Spouses Aboard parch Number of Parents/Children Aboard ticket Ticket Number fare Passenger Fare cabin Cabin embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
Let us Start the fun now. First we will import some libraries
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
%matplotlib inline
we have imported pandas, numpy, sklearn and matplotlib pandas is a great library to load and do EDA on data. sklearn will help us in different transformations and creating a model. matplotlib help us ploting data for visulaizations.
"%matplotlib inline" is a magic function in IPython. it helps us intialize matplotlib and display created plots in the notebook itself instead of a separate window.
Let us Now read the csv files
df = pd.read_csv('./input/train.csv')
test = pd.read_csv('./input/test.csv')
To get some information about loaded data, type following
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 73.1+ KB
To get basic parameters about different variables call describe function.
df.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
df.Survived.value_counts().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0xacb529ac>
df.Sex.value_counts().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0xacb78f4c>
df.Embarked.value_counts().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0xaca5a18c>
df.Pclass.value_counts().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0xacacae8c>
df.Age.hist()
<matplotlib.axes._subplots.AxesSubplot at 0xaca08c6c>
pclass_crosstab = pd.crosstab(df.Pclass,df.Survived)
pclass_crosstab
Survived | 0 | 1 |
---|---|---|
Pclass | ||
1 | 80 | 136 |
2 | 97 | 87 |
3 | 372 | 119 |
pclass_pct = pclass_crosstab.div(pclass_crosstab.sum(1).astype(float) , axis=0)
pclass_pct
Survived | 0 | 1 |
---|---|---|
Pclass | ||
1 | 0.370370 | 0.629630 |
2 | 0.527174 | 0.472826 |
3 | 0.757637 | 0.242363 |
pclass_pct.plot(kind='bar')
pclass_pct.plot(kind='bar' , stacked=True)
<matplotlib.axes._subplots.AxesSubplot at 0xac9570ac>
First get unique genders and the map the gender column to gender_map.male will be converted to 1 and female to 0.
gend_mapping = dict(zip(np.sort(df.Sex.unique()), range(len(df.Sex.unique()))))
df['gend_map']= df.Sex.map(gend_mapping).astype(int)
df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | gend_map | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 1 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 0 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 0 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S | 0 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | NaN | S | 1 |
gend_crosstab = pd.crosstab(df.gend_map , df.Survived)
gend_pct = gend_crosstab.div(gend_crosstab.sum(1).astype(float) , axis=0)
gend_pct.plot(kind='bar' , stacked = True)
<matplotlib.axes._subplots.AxesSubplot at 0xac9c54ec>
#count number of males and females in each class
uniq_pclass = df.Pclass.unique()
print(uniq_pclass)
print("Males in 1st class : ",len(df[(df.Sex == 'male') & (df.Pclass == 1)]))
print("Females in 1st class : ",len(df[(df.Sex == 'female') & (df.Pclass == 1)]))
print("Males in 2nd class : ",len(df[(df.Sex == 'male') & (df.Pclass == 2)]))
print("Females in 2nd class : ",len(df[(df.Sex == 'female') & (df.Pclass == 2)]))
print("Males in 3rd class : ",len(df[(df.Sex == 'male') & (df.Pclass == 3)]))
print("Females in 3rd class : ",len(df[(df.Sex == 'female') & (df.Pclass == 3)]))
#for p_class in uniq_pclass :
[3 1 2] Males in 1st class : 122 Females in 1st class : 94 Males in 2nd class : 108 Females in 2nd class : 76 Males in 3rd class : 347 Females in 3rd class : 144
#female Survival plot by class
female_df = df[df.Sex == 'female']
female_df_crtb = pd.crosstab(female_df.Pclass,female_df.Survived)
female_df_crtb = female_df_crtb.div(female_df_crtb.sum(axis=1).astype(float),axis=0)
female_df_crtb.plot(kind='bar', stacked=True)
<matplotlib.axes._subplots.AxesSubplot at 0xac9fb3ec>
So Majority of First class and Second class passenger females were saved, while comparatively fewer number of females from 3rd class were saved.
male_df = df[df.Sex == 'male']
male_df_crtb=pd.crosstab(male_df.Pclass,male_df.Survived)
male_df_crtb=male_df_crtb.div(male_df_crtb.sum(axis=1).astype(float), axis=0)
male_df_crtb.plot(kind='bar' , stacked=True)
<matplotlib.axes._subplots.AxesSubplot at 0xac8ca16c>
df[df.Embarked.isnull()]
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | gend_map | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
61 | 62 | 1 | 1 | Icard, Miss. Amelie | female | 38 | 0 | 0 | 113572 | 80 | B28 | NaN | 0 |
829 | 830 | 1 | 1 | Stone, Mrs. George Nelson (Martha Evelyn) | female | 62 | 0 | 0 | 113572 | 80 | B28 | NaN | 0 |
#Encode Embarked column
embark_uniq = np.sort(df.Embarked.astype(str).unique())
embark_enc = dict(zip(embark_uniq,range(len(embark_uniq))))
embark_enc
{'C': 0, 'Q': 1, 'S': 2, 'nan': 3}
df['Embarked_map'] = df.Embarked.map(embark_enc)
df.Embarked_map.hist(bins=len(embark_uniq),range=(0,3))
<matplotlib.axes._subplots.AxesSubplot at 0xac8a124c>
df.Embarked_map[df.Embarked_map.isnull()]
61 NaN 829 NaN Name: Embarked_map, dtype: float64
df.Embarked_map.fillna(2,inplace=True)
df.Embarked_map[df.Embarked_map.isnull()]
Series([], Name: Embarked_map, dtype: float64)
df.Embarked[df.Embarked.isnull()]
61 NaN 829 NaN Name: Embarked, dtype: object
df.Embarked.fillna('S',inplace=True)
df.Embarked[df.Embarked.isnull()]
Series([], Name: Embarked, dtype: object)
df.Embarked.unique()
array(['S', 'C', 'Q'], dtype=object)
Age is an Ordinal variable. It also has some missing values. there can be many strategies to fill missing values.
Here we will replace missing vaules in Age with mean value. One more interesting thing we can try is, replacing missing value with values in similar record. Let us assume we will replace age with mean of age in same Passenger Class and gender.
Let us see rows with missing Age value, with additional value of Pclass and gender
age_null = df[df.Age.isnull()][ ['Pclass','Sex' , 'Age']]
age_null.shape
(177, 3)
There are 177 records with null values for Age.
Let us create a new column 'Age_enc' , which will not have any null values replaced with mean(But mean can have decimal value, so lets use Median) in respective Pclass and Gender.
df['Age_enc'] = df['Age']
df.Age_enc = df.Age_enc.groupby([df.gend_map , df.Pclass]).apply(lambda x: x.fillna(x.median()))
len(df[df.Age_enc.isnull()])
0
Let us do Cross tab analysis of Age variable with Survival column
age_crstb = pd.crosstab(df.Age_enc,df.Survived)
age_crstb.plot(kind='bar' , stacked=True)
<matplotlib.axes._subplots.AxesSubplot at 0xac878c8c>
Looks Like we have to bin the data
#bins= (df.Age_enc.max()/len(df.Age_enc))
print(df.Age_enc.max())
age_crstb.hist(bins= (df.Age_enc.max()/10),range=(1, df.Age_enc.max()) , stacked = True)
80.0
array([[<matplotlib.axes._subplots.AxesSubplot object at 0xac732bac>, <matplotlib.axes._subplots.AxesSubplot object at 0xac66f12c>]], dtype=object)
The plot above does not give any clear picture. Let us create a density distribution by class
p_classes = df.Pclass.unique()
for cl in p_classes :
df.Age_enc[df.Pclass == cl].plot(kind='kde')
plt.legend(('1st Class', '2nd Class', '3rd Class'), loc='best')
<matplotlib.legend.Legend at 0xac63be6c>
It looks like First class people were generally older than second class and so was the case with second class with third class.
Data science generally also involves creating new features by combing multiple already existing features. Let us find out how can we do this in this context
Let us combine 'parch' and 'Sibsp' coulmn and create a new column Family size 'FamilySz'
df['FamilySz'] = df.Parch + df.SibSp
df.head(5)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | gend_map | Embarked_map | Age_enc | FamilySz | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 1 | 2 | 22 | 1 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 0 | 0 | 38 | 1 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 0 | 2 | 26 | 0 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S | 0 | 2 | 35 | 1 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | NaN | S | 1 | 2 | 35 | 0 |
df.FamilySz.value_counts().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0xa9ee3a8c>
fmly_crstb = pd.crosstab(df.FamilySz , df.Survived)
fmly_crstb.plot(kind = 'bar' , stacked = True)
<matplotlib.axes._subplots.AxesSubplot at 0xac5d18ac>
fmly_crstb.div(fmly_crstb.sum(axis=1),axis=0)
Survived | 0 | 1 |
---|---|---|
FamilySz | ||
0 | 0.696462 | 0.303538 |
1 | 0.447205 | 0.552795 |
2 | 0.421569 | 0.578431 |
3 | 0.275862 | 0.724138 |
4 | 0.800000 | 0.200000 |
5 | 0.863636 | 0.136364 |
6 | 0.666667 | 0.333333 |
7 | 1.000000 | 0.000000 |
10 | 1.000000 | 0.000000 |
These does not give any clear picture. Let us try executing machine learning algo on this data.
We are not going to use 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked', 'Age', 'SibSp', 'Parch', 'PassengerId'. So let us drop these columns.
let us put all of our cleaning work in a function. So that we can use it with test data also.
def clean_data(df):
#Encoding Gender Column
gend_mapping = dict(zip(np.sort(df.Sex.unique()), range(len(df.Sex.unique()))))
df['gend_map']= df.Sex.map(gend_mapping).astype(int)
#Encode Embarked column
df.Embarked.fillna('S',inplace=True)
embark_uniq = np.sort(df.Embarked.astype(str).unique())
embark_enc = dict(zip(embark_uniq,range(len(embark_uniq))))
df['Embarked_map'] = df.Embarked.map(embark_enc)
# Fill in missing values of Fare with the average Fare
if len(df[df.Fare.isnull()] > 0):
avg_fare = df.Fare.mean()
df.replace({ None: avg_fare }, inplace=True)
#Encode Age
df['Age_enc'] = df.Age
df.Age_enc = df.Age_enc.groupby([df.gend_map , df.Pclass]).apply(lambda x: x.fillna(x.median()))
#Create Family size column
df['FamilySz'] = df.Parch + df.SibSp
# Drop the columns we won't use:
df = df.drop(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked','Age', 'SibSp', 'Parch', 'PassengerId' ], axis=1)
return df
train = clean_data(df)
train.head(5)
Survived | Pclass | Fare | gend_map | Embarked_map | Age_enc | FamilySz | |
---|---|---|---|---|---|---|---|
0 | 0 | 3 | 7.2500 | 1 | 2 | 22 | 1 |
1 | 1 | 1 | 71.2833 | 0 | 0 | 38 | 1 |
2 | 1 | 3 | 7.9250 | 0 | 2 | 26 | 0 |
3 | 1 | 1 | 53.1000 | 0 | 2 | 35 | 1 |
4 | 0 | 3 | 8.0500 | 1 | 2 | 35 | 0 |
Create a RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100)
Training the classifier
y = train.Survived
train = train.drop(['Survived'] , axis=1)
clf.fit(train , y)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False)
clf.score(train,y)
0.98092031425364756
test = clean_data(test)
test.head()
Pclass | Fare | gend_map | Embarked_map | Age_enc | FamilySz | |
---|---|---|---|---|---|---|
0 | 3 | 7.8292 | 1 | 1 | 34.5 | 0 |
1 | 3 | 7.0000 | 0 | 2 | 47.0 | 1 |
2 | 2 | 9.6875 | 1 | 1 | 62.0 | 0 |
3 | 3 | 8.6625 | 1 | 2 | 27.0 | 0 |
4 | 3 | 12.2875 | 0 | 2 | 22.0 | 2 |
Trying Cross Validation with SKLearn
from sklearn import metrics
from sklearn.cross_validation import train_test_split
# Split 80-20 train vs test data
train_x, test_x, train_y, test_y = train_test_split(train,
y,
test_size=0.20,
random_state=0)
print (train.shape, y.shape)
print (train_x.shape, train_y.shape)
print (test_x.shape, test_y.shape)
(891, 6) (891,) (712, 6) (712,) (179, 6) (179,)
from sklearn.metrics import classification_report
print(classification_report(test_y,clf.predict(test_x)))
precision recall f1-score support 0 0.99 0.98 0.99 110 1 0.97 0.99 0.98 69 avg / total 0.98 0.98 0.98 179