Introduction¶

When i started participating in Kaggle competitions. I started with this problem.I feel this is a very easy problem to start learning Data Science. So i thought let me share how i approached this problem. Please feel free to share your comments. I have used IPython to solve this problem. then converted ipython notebook into html file. So if you feel the UI of this page is screwed up. you can see this notebook here

The Problem Statement is very simple. By seeing some example data about people who survived and who died in Titanic,we need to predict that given a new person's data , wether that person will be saved or not. You can read more about this problem statement here

Data¶

You can also get the data from same location. There are two important files there.

train.csv
test.csv

train.csv is the file which contain examples. we will analyze this data and create a mode which will know pattrens about people who were saved and who died. This execise needs python, ipython, numpy, pandas, matplotlib, sklearn to be installed on your machine. I will shortly create a post which has details to install all these.

Before Starting, Let us first try to know that what kind of columns we have in our data.

survival Survival (0 = No; 1 = Yes) pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) name Name sex Sex age Age sibsp Number of Siblings/Spouses Aboard parch Number of Parents/Children Aboard ticket Ticket Number fare Passenger Fare cabin Cabin embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

Let us Start the fun now. First we will import some libraries

Some Imports¶

In [1]:

import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt

%matplotlib inline

we have imported pandas, numpy, sklearn and matplotlib pandas is a great library to load and do EDA on data. sklearn will help us in different transformations and creating a model. matplotlib help us ploting data for visulaizations.

"%matplotlib inline" is a magic function in IPython. it helps us intialize matplotlib and display created plots in the notebook itself instead of a separate window.

Read Data¶

Let us Now read the csv files

In [2]:

df = pd.read_csv('./input/train.csv')
test = pd.read_csv('./input/test.csv')

Start Exploring¶

To get some information about loaded data, type following

In [3]:

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 73.1+ KB

To get basic parameters about different variables call describe function.

In [4]:

df.describe()

Out[4]:

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

Data Ploting as part of Exploration¶

Survived vs Died¶

In [5]:

df.Survived.value_counts().plot(kind='bar')

Out[5]:

<matplotlib.axes._subplots.AxesSubplot at 0xacb529ac>

No of Males vs Females¶

In [6]:

df.Sex.value_counts().plot(kind='bar')

Out[6]:

<matplotlib.axes._subplots.AxesSubplot at 0xacb78f4c>

No of people who boarded from different points¶

In [7]:

df.Embarked.value_counts().plot(kind='bar')

Out[7]:

<matplotlib.axes._subplots.AxesSubplot at 0xaca5a18c>

No of people in different classes¶

In [8]:

df.Pclass.value_counts().plot(kind='bar')

Out[8]:

<matplotlib.axes._subplots.AxesSubplot at 0xacacae8c>

Age distribution¶

In [9]:

df.Age.hist()

Out[9]:

<matplotlib.axes._subplots.AxesSubplot at 0xaca08c6c>

Exploring Relationship between two variables¶

How many people survived in different classes¶

In [10]:

pclass_crosstab = pd.crosstab(df.Pclass,df.Survived)
pclass_crosstab

Out[10]:

Survived	0	1
Pclass
1	80	136
2	97	87
3	372	119

Lets explore by percentage¶

In [11]:

pclass_pct = pclass_crosstab.div(pclass_crosstab.sum(1).astype(float) , axis=0)
pclass_pct

Out[11]:

Survived	0	1
Pclass
1	0.370370	0.629630
2	0.527174	0.472826
3	0.757637	0.242363

Lets Visualize the cross tab table¶

In [12]:

pclass_pct.plot(kind='bar')
pclass_pct.plot(kind='bar' , stacked=True)

Out[12]:

<matplotlib.axes._subplots.AxesSubplot at 0xac9570ac>

Exploring how much gender played any part in saving peoples¶

First get unique genders and the map the gender column to gender_map.male will be converted to 1 and female to 0.

In [13]:

gend_mapping = dict(zip(np.sort(df.Sex.unique()), range(len(df.Sex.unique()))))

In [14]:

df['gend_map']= df.Sex.map(gend_mapping).astype(int)
df.head()

Out[14]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	gend_map
0	1	0	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.2500	NaN	S	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38	1	PC 17599	71.2833	C85	C	0
2	3	1	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.9250	NaN	S	0
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.1000	C123	S	0
4	5	0	3	Allen, Mr. William Henry	male	35	0	373450	8.0500	NaN	S	1

Gender vs Survival Analysis¶

In [15]:

gend_crosstab = pd.crosstab(df.gend_map , df.Survived)
gend_pct = gend_crosstab.div(gend_crosstab.sum(1).astype(float) , axis=0)

In [16]:

gend_pct.plot(kind='bar' , stacked = True)

Out[16]:

<matplotlib.axes._subplots.AxesSubplot at 0xac9c54ec>

Analyze Combined effect of gender and class on Survival output¶

In [17]:

#count number of males and females in each class
uniq_pclass = df.Pclass.unique()
print(uniq_pclass)
print("Males in 1st class : ",len(df[(df.Sex == 'male') & (df.Pclass == 1)]))
print("Females in 1st class : ",len(df[(df.Sex == 'female') & (df.Pclass == 1)]))
print("Males in 2nd class : ",len(df[(df.Sex == 'male') & (df.Pclass == 2)]))
print("Females in 2nd class : ",len(df[(df.Sex == 'female') & (df.Pclass == 2)]))
print("Males in 3rd class : ",len(df[(df.Sex == 'male') & (df.Pclass == 3)]))
print("Females in 3rd class : ",len(df[(df.Sex == 'female') & (df.Pclass == 3)]))
#for p_class in uniq_pclass :
    

[3 1 2]
Males in 1st class :  122
Females in 1st class :  94
Males in 2nd class :  108
Females in 2nd class :  76
Males in 3rd class :  347
Females in 3rd class :  144

Get only Female data¶

In [18]:

#female Survival plot by class
female_df = df[df.Sex == 'female']

female_df_crtb = pd.crosstab(female_df.Pclass,female_df.Survived)
female_df_crtb = female_df_crtb.div(female_df_crtb.sum(axis=1).astype(float),axis=0)

Analyze how female survival rate was affected by Passenger class¶

In [19]:

female_df_crtb.plot(kind='bar', stacked=True)

Out[19]:

<matplotlib.axes._subplots.AxesSubplot at 0xac9fb3ec>

So Majority of First class and Second class passenger females were saved, while comparatively fewer number of females from 3rd class were saved.

Analyze how male survival rate was affected by Passenger class¶

Select only Male data and plot¶

In [20]:

male_df = df[df.Sex == 'male']
male_df_crtb=pd.crosstab(male_df.Pclass,male_df.Survived)
male_df_crtb=male_df_crtb.div(male_df_crtb.sum(axis=1).astype(float), axis=0)

In [21]:

male_df_crtb.plot(kind='bar' , stacked=True)

Out[21]:

<matplotlib.axes._subplots.AxesSubplot at 0xac8ca16c>

Handling Null values¶

Let us see how many null values Embarked column has¶

In [22]:

df[df.Embarked.isnull()]

Out[22]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	gend_map
61	62	1	1	Icard, Miss. Amelie	female	38	0	0	113572	80	B28	NaN	0
829	830	1	1	Stone, Mrs. George Nelson (Martha Evelyn)	female	62	0	0	113572	80	B28	NaN	0

Encode Categorical variable embarked to numerical values instead of text values¶

In [23]:

#Encode Embarked column
embark_uniq = np.sort(df.Embarked.astype(str).unique())
embark_enc = dict(zip(embark_uniq,range(len(embark_uniq))))
embark_enc

Out[23]:

{'C': 0, 'Q': 1, 'S': 2, 'nan': 3}

Create new Encoded column Embarked_map¶

In [24]:

df['Embarked_map'] = df.Embarked.map(embark_enc)

From where most of people started their Journey?¶

In [25]:

df.Embarked_map.hist(bins=len(embark_uniq),range=(0,3))

Out[25]:

<matplotlib.axes._subplots.AxesSubplot at 0xac8a124c>

Replace null values with most frequent ouccurence¶

In [26]:

df.Embarked_map[df.Embarked_map.isnull()]

Out[26]:

61    NaN
829   NaN
Name: Embarked_map, dtype: float64

In [27]:

df.Embarked_map.fillna(2,inplace=True)

In [28]:

df.Embarked_map[df.Embarked_map.isnull()]

Out[28]:

Series([], Name: Embarked_map, dtype: float64)

In [29]:

df.Embarked[df.Embarked.isnull()]

Out[29]:

61     NaN
829    NaN
Name: Embarked, dtype: object

In [30]:

df.Embarked.fillna('S',inplace=True)

In [31]:

df.Embarked[df.Embarked.isnull()]

Out[31]:

Series([], Name: Embarked, dtype: object)

In [32]:

df.Embarked.unique()

Out[32]:

array(['S', 'C', 'Q'], dtype=object)

Analysing Age¶

Age is an Ordinal variable. It also has some missing values. there can be many strategies to fill missing values.

Replace missing value with max occurence (we did this in Embarked case.)
Replace missing value with mean value.

Here we will replace missing vaules in Age with mean value. One more interesting thing we can try is, replacing missing value with values in similar record. Let us assume we will replace age with mean of age in same Passenger Class and gender.

Let us see rows with missing Age value, with additional value of Pclass and gender

In [33]:

age_null = df[df.Age.isnull()][ ['Pclass','Sex' , 'Age']]
age_null.shape

Out[33]:

(177, 3)

There are 177 records with null values for Age.

Let us create a new column 'Age_enc' , which will not have any null values replaced with mean(But mean can have decimal value, so lets use Median) in respective Pclass and Gender.

In [34]:

df['Age_enc'] = df['Age']
df.Age_enc = df.Age_enc.groupby([df.gend_map , df.Pclass]).apply(lambda x: x.fillna(x.median()))

In [35]:

len(df[df.Age_enc.isnull()])

Out[35]:

Age vs Survival Column¶

Let us do Cross tab analysis of Age variable with Survival column

In [36]:

age_crstb = pd.crosstab(df.Age_enc,df.Survived)
age_crstb.plot(kind='bar' , stacked=True)

Out[36]:

<matplotlib.axes._subplots.AxesSubplot at 0xac878c8c>

Looks Like we have to bin the data

In [37]:

#bins= (df.Age_enc.max()/len(df.Age_enc))
print(df.Age_enc.max())
age_crstb.hist(bins= (df.Age_enc.max()/10),range=(1, df.Age_enc.max()) , stacked = True)

80.0

Out[37]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0xac732bac>,
        <matplotlib.axes._subplots.AxesSubplot object at 0xac66f12c>]], dtype=object)

The plot above does not give any clear picture. Let us create a density distribution by class

In [38]:

p_classes = df.Pclass.unique()

for cl in p_classes :
    df.Age_enc[df.Pclass == cl].plot(kind='kde')
    
plt.legend(('1st Class', '2nd Class', '3rd Class'), loc='best')

Out[38]:

<matplotlib.legend.Legend at 0xac63be6c>

It looks like First class people were generally older than second class and so was the case with second class with third class.

Creating new Features¶

Data science generally also involves creating new features by combing multiple already existing features. Let us find out how can we do this in this context

Let us combine 'parch' and 'Sibsp' coulmn and create a new column Family size 'FamilySz'

In [39]:

df['FamilySz'] = df.Parch + df.SibSp
df.head(5)

Out[39]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	gend_map	Embarked_map	Age_enc	FamilySz
0	1	0	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.2500	NaN	S	1	2	22	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38	1	PC 17599	71.2833	C85	C	0	0	38	1
2	3	1	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.9250	NaN	S	0	2	26	0
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.1000	C123	S	0	2	35	1
4	5	0	3	Allen, Mr. William Henry	male	35	0	373450	8.0500	NaN	S	1	2	35	0

In [40]:

df.FamilySz.value_counts().plot(kind='bar')

Out[40]:

<matplotlib.axes._subplots.AxesSubplot at 0xa9ee3a8c>

Plot Family Size vs Survival¶

In [41]:

fmly_crstb = pd.crosstab(df.FamilySz , df.Survived)
fmly_crstb.plot(kind = 'bar' , stacked = True)

Out[41]:

<matplotlib.axes._subplots.AxesSubplot at 0xac5d18ac>

In [42]:

fmly_crstb.div(fmly_crstb.sum(axis=1),axis=0)

Out[42]:

Survived	0	1
FamilySz
0	0.696462	0.303538
1	0.447205	0.552795
2	0.421569	0.578431
3	0.275862	0.724138
4	0.800000	0.200000
5	0.863636	0.136364
6	0.666667	0.333333
7	1.000000	0.000000
10	1.000000	0.000000

These does not give any clear picture. Let us try executing machine learning algo on this data.

Apply Machine Learning on data¶

Dropping unnecessery columns¶

We are not going to use 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked', 'Age', 'SibSp', 'Parch', 'PassengerId'. So let us drop these columns.

let us put all of our cleaning work in a function. So that we can use it with test data also.

In [43]:

def clean_data(df):
    #Encoding Gender Column
    gend_mapping = dict(zip(np.sort(df.Sex.unique()), range(len(df.Sex.unique()))))
    df['gend_map']= df.Sex.map(gend_mapping).astype(int)
    
    #Encode Embarked column
    df.Embarked.fillna('S',inplace=True)
    embark_uniq = np.sort(df.Embarked.astype(str).unique())
    embark_enc = dict(zip(embark_uniq,range(len(embark_uniq))))
    df['Embarked_map'] = df.Embarked.map(embark_enc)
    
    # Fill in missing values of Fare with the average Fare
    if len(df[df.Fare.isnull()] > 0):
        avg_fare = df.Fare.mean()
        df.replace({ None: avg_fare }, inplace=True)

    #Encode Age
    df['Age_enc'] = df.Age
    df.Age_enc = df.Age_enc.groupby([df.gend_map , df.Pclass]).apply(lambda x: x.fillna(x.median()))
    
    #Create Family size column
    df['FamilySz'] = df.Parch + df.SibSp
                   
    # Drop the columns we won't use:
    df = df.drop(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked','Age', 'SibSp', 'Parch', 'PassengerId' ], axis=1)
    
    return df
    

In [44]:

train = clean_data(df)
train.head(5)

Out[44]:

	Survived	Pclass	Fare	gend_map	Embarked_map	Age_enc	FamilySz
0	0	3	7.2500	1	2	22	1
1	1	1	71.2833	0	0	38	1
2	1	3	7.9250	0	2	26	0
3	1	1	53.1000	0	2	35	1
4	0	3	8.0500	1	2	35	0

Create a RandomForestClassifier

In [45]:

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100)

Training the classifier

In [46]:

y = train.Survived
train = train.drop(['Survived'] , axis=1)
clf.fit(train , y)

Out[46]:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [47]:

clf.score(train,y)

Out[47]:

0.98092031425364756

In [48]:

test = clean_data(test)
test.head()

Out[48]:

	Pclass	Fare	gend_map	Embarked_map	Age_enc	FamilySz
0	3	7.8292	1	1	34.5	0
1	3	7.0000	0	2	47.0	1
2	2	9.6875	1	1	62.0	0
3	3	8.6625	1	2	27.0	0
4	3	12.2875	0	2	22.0	2

Trying Cross Validation with SKLearn

In [49]:

from sklearn import metrics
from sklearn.cross_validation import train_test_split

# Split 80-20 train vs test data
train_x, test_x, train_y, test_y = train_test_split(train, 
                                                    y, 
                                                    test_size=0.20, 
                                                    random_state=0)
print (train.shape, y.shape)
print (train_x.shape, train_y.shape)
print (test_x.shape, test_y.shape)

(891, 6) (891,)
(712, 6) (712,)
(179, 6) (179,)

Creating Classification report¶

In [50]:

from sklearn.metrics import classification_report
print(classification_report(test_y,clf.predict(test_x)))

             precision    recall  f1-score   support

          0       0.99      0.98      0.99       110
          1       0.97      0.99      0.98        69

avg / total       0.98      0.98      0.98       179