This short notebook illustrates basic usage of the OutlierTree library for explainable outlier detection using the Titanic dataset. For more details, you can check the package's documentation here.
The dataset is very popular and can be downloaded from different sources, such as Kaggle or many university webpages. This notebook took it from the following link: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv
import numpy as np, pandas as pd
from outliertree import OutlierTree
## Read the raw data, downloaded from here:
## https://github.com/jbryer/CompStats/raw/master/Data/titanic3.csv
titanic = pd.read_csv("titanic3.csv")
titanic.head()
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.00 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | NaN | St Louis, MO |
1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.92 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 11 | NaN | Montreal, PQ / Chesterville, ON |
2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.00 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.00 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | 135.0 | Montreal, PQ / Chesterville, ON |
4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.00 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
## Capitalize column names and some values for easier reading
titanic.columns = titanic.columns.str.capitalize()
titanic = titanic.rename(columns = {"Sibsp" : "SibSp"})
titanic["Sex"] = titanic["Sex"].str.capitalize()
## Convert 'survived' to yes/no for easier reading
titanic["Survived"] = titanic["Survived"].astype("category").replace({1:"Yes", 0:"No"})
## Some columns are not useful, such as name (an ID), ticket number (another ID),
## or destination (too many values, many non-repeated)
cols_drop = ["Name", "Ticket", "Home.dest"]
titanic = titanic.drop(cols_drop, axis=1)
## Ordinal columns need to be passed as ordered categoricals
cols_ord = ["Pclass", "Parch", "SibSp"]
for col in cols_ord:
titanic[col] = pd.Categorical(titanic[col], ordered=True)
titanic.head()
Pclass | Survived | Sex | Age | SibSp | Parch | Fare | Cabin | Embarked | Boat | Body | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Yes | Female | 29.00 | 0 | 0 | 211.3375 | B5 | S | 2 | NaN |
1 | 1 | Yes | Male | 0.92 | 1 | 2 | 151.5500 | C22 C26 | S | 11 | NaN |
2 | 1 | No | Female | 2.00 | 1 | 2 | 151.5500 | C22 C26 | S | NaN | NaN |
3 | 1 | No | Male | 30.00 | 1 | 2 | 151.5500 | C22 C26 | S | NaN | 135.0 |
4 | 1 | No | Female | 25.00 | 1 | 2 | 151.5500 | C22 C26 | S | NaN | NaN |
## Fit model with default hyperparameters
otree = OutlierTree()
otree.fit(titanic)
Reporting top 9 outliers [out of 9 found] row [170] - suspicious column: [Fare] - suspicious value: [0.00] distribution: 98.571% >= 25.74 - [mean: 55.22] - [sd: 27.56] - [norm. obs: 69] given: [Pclass] = [1] [Boat] in [9, B, 5, 7, C, 5 9, 1, 15, 5 7, 8 10, 12, 16, 13 15 B, C D, 15 16, 13 15] (value: C) row [18] - suspicious column: [Age] - suspicious value: [32.00] distribution: 96.000% >= 43.00 - [mean: 48.35] - [sd: 3.16] - [norm. obs: 24] given: [Cabin] in [E12, D15, B10, E31, E58, C86, A16, A20, E63, C92, B82 B84, D33, B52 B54 B56, C124, D17, C110, C116, C126, D46] (value: D15) row [896] - suspicious column: [Fare] - suspicious value: [0.00] distribution: 99.216% >= 3.17 - [mean: 9.68] - [sd: 6.98] - [norm. obs: 506] given: [Pclass] = [3] [SibSp] = [0] row [898] - suspicious column: [Fare] - suspicious value: [0.00] distribution: 99.216% >= 3.17 - [mean: 9.68] - [sd: 6.98] - [norm. obs: 506] given: [Pclass] = [3] [SibSp] = [0] row [963] - suspicious column: [Fare] - suspicious value: [0.00] distribution: 99.216% >= 3.17 - [mean: 9.68] - [sd: 6.98] - [norm. obs: 506] given: [Pclass] = [3] [SibSp] = [0] row [1254] - suspicious column: [Fare] - suspicious value: [0.00] distribution: 99.216% >= 3.17 - [mean: 9.68] - [sd: 6.98] - [norm. obs: 506] given: [Pclass] = [3] [SibSp] = [0] row [1044] - suspicious column: [Fare] - suspicious value: [15.50] distribution: 96.774% <= 8.52 - [mean: 7.73] - [sd: 0.28] - [norm. obs: 30] given: [Pclass] = [3] [SibSp] = [0] [Boat] in [3, 10, 4, 9, 6, B, 8, A, 5, 7, 5 9, 1, 5 7, 8 10, 16, 13 15 B, 15 16, 13 15] (value: 16) row [1146] - suspicious column: [Fare] - suspicious value: [29.12] distribution: 97.849% <= 15.50 - [mean: 7.89] - [sd: 1.17] - [norm. obs: 91] given: [Pclass] = [3] [SibSp] = [0] [Embarked] = [Q] row [1163] - suspicious column: [Fare] - suspicious value: [24.15] distribution: 97.849% <= 15.50 - [mean: 7.89] - [sd: 1.17] - [norm. obs: 91] given: [Pclass] = [3] [SibSp] = [0] [Embarked] = [Q]
OutlierTree model Numeric variables: 3 Categorical variables: 5 Ordinal variables: 3 Consists of 221 clusters, spread across 18 tree branches
## Double-check the data (last 2 outliers)
titanic.loc[[1146, 1163]]
Pclass | Survived | Sex | Age | SibSp | Parch | Fare | Cabin | Embarked | Boat | Body | |
---|---|---|---|---|---|---|---|---|---|---|---|
1146 | 3 | No | Female | 39.0 | 0 | 5 | 29.125 | NaN | Q | NaN | 327.0 |
1163 | 3 | No | Male | NaN | 0 | 0 | 24.150 | NaN | Q | NaN | NaN |
## Distribution of the group from which those two outliers were flagged
%matplotlib inline
import matplotlib.pyplot as plt
titanic.loc[
(titanic.Pclass == 3) &
(titanic.SibSp == 0) &
(titanic.Embarked == "Q")
] .Fare.hist(bins=50, color="navy", edgecolor='black', linewidth=1.2)
plt.xlabel("Fare", fontsize=15)
plt.ylabel("Frequency", fontsize=15)
plt.title("Distribution of Fare within cluster", fontsize=20)
plt.show()
## Get the outliers in a manipulable format
otree.predict(titanic).loc[[1146, 1163]]
suspicious_value | group_statistics | conditions | tree_depth | uses_NA_branch | outlier_score | |
---|---|---|---|---|---|---|
1146 | {'column': 'Fare', 'value': 29.125, 'decimals'... | {'upper_thr': 15.5, 'pct_below': 0.97849462365... | [{'column': 'Embarked', 'comparison': '=', 'va... | 4.0 | False | 0.003805 |
1163 | {'column': 'Fare', 'value': 24.15, 'decimals': 0} | {'upper_thr': 15.5, 'pct_below': 0.97849462365... | [{'column': 'Embarked', 'comparison': '=', 'va... | 4.0 | False | 0.005227 |
## To programatically get all the outliers that were flagged
pred = otree.predict(titanic)
pred.loc[~pred.outlier_score.isnull()]
suspicious_value | group_statistics | conditions | tree_depth | uses_NA_branch | outlier_score | |
---|---|---|---|---|---|---|
18 | {'column': 'Age', 'value': 32.0, 'decimals': 0} | {'lower_thr': 43.0, 'pct_above': 0.96, 'mean':... | [{'column': 'Cabin', 'comparison': 'in', 'valu... | 3.0 | False | 0.007545 |
170 | {'column': 'Fare', 'value': 0.0, 'decimals': 0} | {'lower_thr': 25.7417, 'pct_above': 0.98571428... | [{'column': 'Boat', 'comparison': 'in', 'value... | 2.0 | False | 0.015339 |
896 | {'column': 'Fare', 'value': 0.0, 'decimals': 0} | {'lower_thr': 3.1708, 'pct_above': 0.992156862... | [{'column': 'Pclass', 'comparison': '=', 'valu... | 3.0 | False | 0.011148 |
898 | {'column': 'Fare', 'value': 0.0, 'decimals': 0} | {'lower_thr': 3.1708, 'pct_above': 0.992156862... | [{'column': 'Pclass', 'comparison': '=', 'valu... | 3.0 | False | 0.011148 |
963 | {'column': 'Fare', 'value': 0.0, 'decimals': 0} | {'lower_thr': 3.1708, 'pct_above': 0.992156862... | [{'column': 'Pclass', 'comparison': '=', 'valu... | 3.0 | False | 0.011148 |
1044 | {'column': 'Fare', 'value': 15.5, 'decimals': 0} | {'upper_thr': 8.5167, 'pct_below': 0.967741935... | [{'column': 'Boat', 'comparison': 'in', 'value... | 4.0 | False | 0.002018 |
1146 | {'column': 'Fare', 'value': 29.125, 'decimals'... | {'upper_thr': 15.5, 'pct_below': 0.97849462365... | [{'column': 'Embarked', 'comparison': '=', 'va... | 4.0 | False | 0.003805 |
1163 | {'column': 'Fare', 'value': 24.15, 'decimals': 0} | {'upper_thr': 15.5, 'pct_below': 0.97849462365... | [{'column': 'Embarked', 'comparison': '=', 'va... | 4.0 | False | 0.005227 |
1254 | {'column': 'Fare', 'value': 0.0, 'decimals': 0} | {'lower_thr': 3.1708, 'pct_above': 0.992156862... | [{'column': 'Pclass', 'comparison': '=', 'valu... | 3.0 | False | 0.011148 |
## To print selected rows only
otree.print_outliers(pred.loc[[1146]])
Reporting top 1 outliers [out of 1 found] row [1146] - suspicious column: [Fare] - suspicious value: [29.12] distribution: 97.849% <= 15.50 - [mean: 7.89] - [sd: 1.17] - [norm. obs: 91] given: [Pclass] = [3] [SibSp] = [0] [Embarked] = [Q]
## In order to flag more outliers, one can also experiment
## with lowering the threshold hyperparameters
OutlierTree(z_outlier=6.).fit(titanic, outliers_print=5)
Reporting top 5 outliers [out of 20 found] row [363] - suspicious column: [Fare] - suspicious value: [0.00] distribution: 98.555% >= 3.17 - [mean: 11.66] - [sd: 9.02] - [norm. obs: 682] given: [Pclass] in [2, 3] (value: 2) [SibSp] = [0] row [384] - suspicious column: [Fare] - suspicious value: [0.00] distribution: 98.555% >= 3.17 - [mean: 11.66] - [sd: 9.02] - [norm. obs: 682] given: [Pclass] in [2, 3] (value: 2) [SibSp] = [0] row [410] - suspicious column: [Fare] - suspicious value: [0.00] distribution: 98.555% >= 3.17 - [mean: 11.66] - [sd: 9.02] - [norm. obs: 682] given: [Pclass] in [2, 3] (value: 2) [SibSp] = [0] row [473] - suspicious column: [Fare] - suspicious value: [0.00] distribution: 98.555% >= 3.17 - [mean: 11.66] - [sd: 9.02] - [norm. obs: 682] given: [Pclass] in [2, 3] (value: 2) [SibSp] = [0] row [528] - suspicious column: [Fare] - suspicious value: [0.00] distribution: 98.555% >= 3.17 - [mean: 11.66] - [sd: 9.02] - [norm. obs: 682] given: [Pclass] in [2, 3] (value: 2) [SibSp] = [0]
OutlierTree model Numeric variables: 3 Categorical variables: 5 Ordinal variables: 3 Consists of 217 clusters, spread across 18 tree branches
## One can also lower the gain threshold, but this tends
## to result in more spurious outliers which come from
## not-so-good splits (not recommended)
OutlierTree(z_outlier=6, min_gain=1e-6).fit(titanic, outliers_print=5)
Reporting top 5 outliers [out of 27 found] row [545] - suspicious column: [SibSp] - suspicious value: [3] distribution: 99.701% in [0, 1, 2, 5, 8] ( [norm. obs: 999] - [prior_prob: 1.528%] - [next smallest: 2.595%] ) given: [Parch] = [0] row [656] - suspicious column: [SibSp] - suspicious value: [3] distribution: 99.701% in [0, 1, 2, 5, 8] ( [norm. obs: 999] - [prior_prob: 1.528%] - [next smallest: 2.595%] ) given: [Parch] = [0] row [1274] - suspicious column: [SibSp] - suspicious value: [3] distribution: 99.701% in [0, 1, 2, 5, 8] ( [norm. obs: 999] - [prior_prob: 1.528%] - [next smallest: 2.595%] ) given: [Parch] = [0] row [363] - suspicious column: [Fare] - suspicious value: [0.00] distribution: 98.555% >= 3.17 - [mean: 11.66] - [sd: 9.02] - [norm. obs: 682] given: [Pclass] in [2, 3] (value: 2) [SibSp] = [0] row [384] - suspicious column: [Fare] - suspicious value: [0.00] distribution: 98.555% >= 3.17 - [mean: 11.66] - [sd: 9.02] - [norm. obs: 682] given: [Pclass] in [2, 3] (value: 2) [SibSp] = [0]
OutlierTree model Numeric variables: 3 Categorical variables: 5 Ordinal variables: 3 Consists of 283 clusters, spread across 23 tree branches