Explainable Outlier Detection in Titanic dataset¶

This short notebook illustrates basic usage of the OutlierTree library for explainable outlier detection using the Titanic dataset. For more details, you can check the package's documentation here.

The dataset is very popular and can be downloaded from different sources, such as Kaggle or many university webpages. This notebook took it from the following link: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv

Loading the raw data¶

In [1]:

import numpy as np, pandas as pd
from outliertree import OutlierTree

## Read the raw data, downloaded from here:
## https://github.com/jbryer/CompStats/raw/master/Data/titanic3.csv
titanic = pd.read_csv("titanic3.csv")
titanic.head()

Out[1]:

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest
0	1	1	Allen, Miss. Elisabeth Walton	female	29.00	0	0	24160	211.3375	B5	S	2	NaN	St Louis, MO
1	1	1	Allison, Master. Hudson Trevor	male	0.92	1	2	113781	151.5500	C22 C26	S	11	NaN	Montreal, PQ / Chesterville, ON
2	1	0	Allison, Miss. Helen Loraine	female	2.00	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON
3	1	0	Allison, Mr. Hudson Joshua Creighton	male	30.00	1	2	113781	151.5500	C22 C26	S	NaN	135.0	Montreal, PQ / Chesterville, ON
4	1	0	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.00	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON

Pre-processing the data¶

In [2]:

## Capitalize column names and some values for easier reading
titanic.columns = titanic.columns.str.capitalize()
titanic = titanic.rename(columns = {"Sibsp" : "SibSp"})
titanic["Sex"] = titanic["Sex"].str.capitalize()

## Convert 'survived' to yes/no for easier reading
titanic["Survived"] = titanic["Survived"].astype("category").replace({1:"Yes", 0:"No"})

## Some columns are not useful, such as name (an ID), ticket number (another ID),
## or destination (too many values, many non-repeated)
cols_drop = ["Name", "Ticket", "Home.dest"]
titanic = titanic.drop(cols_drop, axis=1)

## Ordinal columns need to be passed as ordered categoricals
cols_ord = ["Pclass", "Parch", "SibSp"]
for col in cols_ord:
    titanic[col] = pd.Categorical(titanic[col], ordered=True)

titanic.head()

Out[2]:

	Pclass	Survived	Sex	Age	SibSp	Parch	Fare	Cabin	Embarked	Boat	Body
0	1	Yes	Female	29.00	0	0	211.3375	B5	S	2	NaN
1	1	Yes	Male	0.92	1	2	151.5500	C22 C26	S	11	NaN
2	1	No	Female	2.00	1	2	151.5500	C22 C26	S	NaN	NaN
3	1	No	Male	30.00	1	2	151.5500	C22 C26	S	NaN	135.0
4	1	No	Female	25.00	1	2	151.5500	C22 C26	S	NaN	NaN

Fitting a model¶

In [3]:

## Fit model with default hyperparameters
otree = OutlierTree()
otree.fit(titanic)

Reporting top 9 outliers [out of 9 found]


row [170] - suspicious column: [Fare] - suspicious value: [0.00]
	distribution: 98.571% >= 25.74 - [mean: 55.22] - [sd: 27.56] - [norm. obs: 69]
	given:
		[Pclass] = [1]
		[Boat] in [9, B, 5, 7, C, 5 9, 1, 15, 5 7, 8 10, 12, 16, 13 15 B, C D, 15 16, 13 15] (value: C)


row [18] - suspicious column: [Age] - suspicious value: [32.00]
	distribution: 96.000% >= 43.00 - [mean: 48.35] - [sd: 3.16] - [norm. obs: 24]
	given:
		[Cabin] in [E12, D15, B10, E31, E58, C86, A16, A20, E63, C92, B82 B84, D33, B52 B54 B56, C124, D17, C110, C116, C126, D46] (value: D15)


row [896] - suspicious column: [Fare] - suspicious value: [0.00]
	distribution: 99.216% >= 3.17 - [mean: 9.68] - [sd: 6.98] - [norm. obs: 506]
	given:
		[Pclass] = [3]
		[SibSp] = [0]


row [898] - suspicious column: [Fare] - suspicious value: [0.00]
	distribution: 99.216% >= 3.17 - [mean: 9.68] - [sd: 6.98] - [norm. obs: 506]
	given:
		[Pclass] = [3]
		[SibSp] = [0]


row [963] - suspicious column: [Fare] - suspicious value: [0.00]
	distribution: 99.216% >= 3.17 - [mean: 9.68] - [sd: 6.98] - [norm. obs: 506]
	given:
		[Pclass] = [3]
		[SibSp] = [0]


row [1254] - suspicious column: [Fare] - suspicious value: [0.00]
	distribution: 99.216% >= 3.17 - [mean: 9.68] - [sd: 6.98] - [norm. obs: 506]
	given:
		[Pclass] = [3]
		[SibSp] = [0]


row [1044] - suspicious column: [Fare] - suspicious value: [15.50]
	distribution: 96.774% <= 8.52 - [mean: 7.73] - [sd: 0.28] - [norm. obs: 30]
	given:
		[Pclass] = [3]
		[SibSp] = [0]
		[Boat] in [3, 10, 4, 9, 6, B, 8, A, 5, 7, 5 9, 1, 5 7, 8 10, 16, 13 15 B, 15 16, 13 15] (value: 16)


row [1146] - suspicious column: [Fare] - suspicious value: [29.12]
	distribution: 97.849% <= 15.50 - [mean: 7.89] - [sd: 1.17] - [norm. obs: 91]
	given:
		[Pclass] = [3]
		[SibSp] = [0]
		[Embarked] = [Q]


row [1163] - suspicious column: [Fare] - suspicious value: [24.15]
	distribution: 97.849% <= 15.50 - [mean: 7.89] - [sd: 1.17] - [norm. obs: 91]
	given:
		[Pclass] = [3]
		[SibSp] = [0]
		[Embarked] = [Q]

Out[3]:

OutlierTree model
	Numeric variables: 3
	Categorical variables: 5
	Ordinal variables: 3

Consists of 221 clusters, spread across 18 tree branches

Examining the results more closely¶

In [4]:

## Double-check the data (last 2 outliers)
titanic.loc[[1146, 1163]]

Out[4]:

	Pclass	Survived	Sex	Age	SibSp	Parch	Fare	Cabin	Embarked	Boat	Body
1146	3	No	Female	39.0	0	5	29.125	NaN	Q	NaN	327.0
1163	3	No	Male	NaN	0	0	24.150	NaN	Q	NaN	NaN

In [5]:

## Distribution of the group from which those two outliers were flagged
%matplotlib inline
import matplotlib.pyplot as plt
titanic.loc[
    (titanic.Pclass == 3)     &
    (titanic.SibSp == 0)      &
    (titanic.Embarked == "Q")
] .Fare.hist(bins=50, color="navy", edgecolor='black', linewidth=1.2)
plt.xlabel("Fare", fontsize=15)
plt.ylabel("Frequency", fontsize=15)
plt.title("Distribution of Fare within cluster", fontsize=20)
plt.show()

In [6]:

## Get the outliers in a manipulable format
otree.predict(titanic).loc[[1146, 1163]]

Out[6]:

	suspicious_value	group_statistics	conditions	tree_depth	uses_NA_branch	outlier_score
1146	{'column': 'Fare', 'value': 29.125, 'decimals'...	{'upper_thr': 15.5, 'pct_below': 0.97849462365...	[{'column': 'Embarked', 'comparison': '=', 'va...	4.0	False	0.003805
1163	{'column': 'Fare', 'value': 24.15, 'decimals': 0}	{'upper_thr': 15.5, 'pct_below': 0.97849462365...	[{'column': 'Embarked', 'comparison': '=', 'va...	4.0	False	0.005227

In [7]:

## To programatically get all the outliers that were flagged
pred = otree.predict(titanic)
pred.loc[~pred.outlier_score.isnull()]

Out[7]:

	suspicious_value	group_statistics	conditions	tree_depth	uses_NA_branch	outlier_score
18	{'column': 'Age', 'value': 32.0, 'decimals': 0}	{'lower_thr': 43.0, 'pct_above': 0.96, 'mean':...	[{'column': 'Cabin', 'comparison': 'in', 'valu...	3.0	False	0.007545
170	{'column': 'Fare', 'value': 0.0, 'decimals': 0}	{'lower_thr': 25.7417, 'pct_above': 0.98571428...	[{'column': 'Boat', 'comparison': 'in', 'value...	2.0	False	0.015339
896	{'column': 'Fare', 'value': 0.0, 'decimals': 0}	{'lower_thr': 3.1708, 'pct_above': 0.992156862...	[{'column': 'Pclass', 'comparison': '=', 'valu...	3.0	False	0.011148
898	{'column': 'Fare', 'value': 0.0, 'decimals': 0}	{'lower_thr': 3.1708, 'pct_above': 0.992156862...	[{'column': 'Pclass', 'comparison': '=', 'valu...	3.0	False	0.011148
963	{'column': 'Fare', 'value': 0.0, 'decimals': 0}	{'lower_thr': 3.1708, 'pct_above': 0.992156862...	[{'column': 'Pclass', 'comparison': '=', 'valu...	3.0	False	0.011148
1044	{'column': 'Fare', 'value': 15.5, 'decimals': 0}	{'upper_thr': 8.5167, 'pct_below': 0.967741935...	[{'column': 'Boat', 'comparison': 'in', 'value...	4.0	False	0.002018
1146	{'column': 'Fare', 'value': 29.125, 'decimals'...	{'upper_thr': 15.5, 'pct_below': 0.97849462365...	[{'column': 'Embarked', 'comparison': '=', 'va...	4.0	False	0.003805
1163	{'column': 'Fare', 'value': 24.15, 'decimals': 0}	{'upper_thr': 15.5, 'pct_below': 0.97849462365...	[{'column': 'Embarked', 'comparison': '=', 'va...	4.0	False	0.005227
1254	{'column': 'Fare', 'value': 0.0, 'decimals': 0}	{'lower_thr': 3.1708, 'pct_above': 0.992156862...	[{'column': 'Pclass', 'comparison': '=', 'valu...	3.0	False	0.011148

In [8]:

## To print selected rows only
otree.print_outliers(pred.loc[[1146]])

Reporting top 1 outliers [out of 1 found]


row [1146] - suspicious column: [Fare] - suspicious value: [29.12]
	distribution: 97.849% <= 15.50 - [mean: 7.89] - [sd: 1.17] - [norm. obs: 91]
	given:
		[Pclass] = [3]
		[SibSp] = [0]
		[Embarked] = [Q]

Trying different hyperparameters¶

In [9]:

## In order to flag more outliers, one can also experiment
## with lowering the threshold hyperparameters
OutlierTree(z_outlier=6.).fit(titanic, outliers_print=5)

Reporting top 5 outliers [out of 20 found]


row [363] - suspicious column: [Fare] - suspicious value: [0.00]
	distribution: 98.555% >= 3.17 - [mean: 11.66] - [sd: 9.02] - [norm. obs: 682]
	given:
		[Pclass] in [2, 3] (value: 2)
		[SibSp] = [0]


row [384] - suspicious column: [Fare] - suspicious value: [0.00]
	distribution: 98.555% >= 3.17 - [mean: 11.66] - [sd: 9.02] - [norm. obs: 682]
	given:
		[Pclass] in [2, 3] (value: 2)
		[SibSp] = [0]


row [410] - suspicious column: [Fare] - suspicious value: [0.00]
	distribution: 98.555% >= 3.17 - [mean: 11.66] - [sd: 9.02] - [norm. obs: 682]
	given:
		[Pclass] in [2, 3] (value: 2)
		[SibSp] = [0]


row [473] - suspicious column: [Fare] - suspicious value: [0.00]
	distribution: 98.555% >= 3.17 - [mean: 11.66] - [sd: 9.02] - [norm. obs: 682]
	given:
		[Pclass] in [2, 3] (value: 2)
		[SibSp] = [0]


row [528] - suspicious column: [Fare] - suspicious value: [0.00]
	distribution: 98.555% >= 3.17 - [mean: 11.66] - [sd: 9.02] - [norm. obs: 682]
	given:
		[Pclass] in [2, 3] (value: 2)
		[SibSp] = [0]

Out[9]:

OutlierTree model
	Numeric variables: 3
	Categorical variables: 5
	Ordinal variables: 3

Consists of 217 clusters, spread across 18 tree branches

In [10]:

## One can also lower the gain threshold, but this tends
## to result in more spurious outliers which come from
## not-so-good splits (not recommended)
OutlierTree(z_outlier=6, min_gain=1e-6).fit(titanic, outliers_print=5)

Reporting top 5 outliers [out of 27 found]


row [545] - suspicious column: [SibSp] - suspicious value: [3]
	distribution: 99.701% in [0, 1, 2, 5, 8]
	( [norm. obs: 999] - [prior_prob: 1.528%] - [next smallest: 2.595%] )
	given:
		[Parch] = [0]


row [656] - suspicious column: [SibSp] - suspicious value: [3]
	distribution: 99.701% in [0, 1, 2, 5, 8]
	( [norm. obs: 999] - [prior_prob: 1.528%] - [next smallest: 2.595%] )
	given:
		[Parch] = [0]


row [1274] - suspicious column: [SibSp] - suspicious value: [3]
	distribution: 99.701% in [0, 1, 2, 5, 8]
	( [norm. obs: 999] - [prior_prob: 1.528%] - [next smallest: 2.595%] )
	given:
		[Parch] = [0]


row [363] - suspicious column: [Fare] - suspicious value: [0.00]
	distribution: 98.555% >= 3.17 - [mean: 11.66] - [sd: 9.02] - [norm. obs: 682]
	given:
		[Pclass] in [2, 3] (value: 2)
		[SibSp] = [0]


row [384] - suspicious column: [Fare] - suspicious value: [0.00]
	distribution: 98.555% >= 3.17 - [mean: 11.66] - [sd: 9.02] - [norm. obs: 682]
	given:
		[Pclass] in [2, 3] (value: 2)
		[SibSp] = [0]

Out[10]:

OutlierTree model
	Numeric variables: 3
	Categorical variables: 5
	Ordinal variables: 3

Consists of 283 clusters, spread across 23 tree branches