Mahalanobis Distance and Outliers blog post at http://kldavenport.com
import numpy as np
# as column vectors
x = np.random.poisson(5,10)
y = np.random.poisson(5,10)
# examine the shape of the numpy arrays
print x.shape
print y.shape
(10,) (10,)
An examination of what numpy's cov expects: A 1-D or 2-D array containing multiple variables and observations. Each row of m represents a variable, and each column a single observation of all those variables. In the event we are dealing with column vectors instead of row vectors, the rowvar parameter can be set to 1.
Here we calculate the covariance matrix:
covariance_xy = np.cov(x,y, rowvar=0)
inv_covariance_xy = np.linalg.inv(covariance_xy)
# Examine inverse covariance matrix and shape
print inv_covariance_xy
print inv_covariance_xy.shape
[[ 0.18716976 -0.05487061] [-0.05487061 0.11380572]] (2, 2)
Center each value by the mean by subtracting the mean from i in array x and y.
xy_mean = np.mean(x),np.mean(y)
x_diff = np.array([x_i - xy_mean[0] for x_i in x])
y_diff = np.array([y_i - xy_mean[1] for y_i in y])
print x_diff.shape
(10,)
Join the x_diff and y_diff arrays into (10 x 2) array to be used in our formula later. Consider these coordinate pairs.
diff_xy = np.transpose([x_diff, y_diff])
diff_xy.shape, diff_xy
((10, 2), array([[ 4. , -3.3], [-2. , -1.3], [-2. , -1.3], [-3. , -1.3], [ 2. , 4.7], [-1. , -2.3], [ 3. , 6.7], [ 2. , -1.3], [-1. , 0.7], [-2. , -1.3]]))
As stated above: $ MD = \sqrt{ (x - \mu)^T {\rm \bf \Sigma}^{-1} (x - \mu)}$,
So to put together this formula we'll need the square root of the transpose of our diff_xy array, the inverse of our covariance_xy matrix which we've assigned to inv_covariance_xy, and our original diff_xy array. Examining their shapes below for one final check before performing dot multiplication (remember u•v = v•u and u•(v + w) = u•w + v•w):
print np.transpose(diff_xy[1]).shape
print inv_covariance_xy.shape
print diff_xy.shape
(2,) (2, 2) (10, 2)
md = []
for i in range(len(diff_xy)):
md.append(np.sqrt(np.dot(np.dot(np.transpose(diff_xy[i]),inv_covariance_xy),diff_xy[i])))
md
[2.3838298236449629, 0.80974286765570169, 0.80974286765570169, 1.2036896265134212, 1.4936799521601556, 0.73266444021370714, 2.1418374874831754, 1.107401415293582, 0.5654674357959717, 0.80974286765570169]
def MahalanobisDist(x, y):
covariance_xy = np.cov(x,y, rowvar=0)
inv_covariance_xy = np.linalg.inv(covariance_xy)
xy_mean = np.mean(x),np.mean(y)
x_diff = np.array([x_i - xy_mean[0] for x_i in x])
y_diff = np.array([y_i - xy_mean[1] for y_i in y])
diff_xy = np.transpose([x_diff, y_diff])
md = []
for i in range(len(diff_xy)):
md.append(np.sqrt(np.dot(np.dot(np.transpose(diff_xy[i]),inv_covariance_xy),diff_xy[i])))
return md
MahalanobisDist(x,y)
[2.3838298236449629, 0.80974286765570169, 0.80974286765570169, 1.2036896265134212, 1.4936799521601556, 0.73266444021370714, 2.1418374874831754, 1.107401415293582, 0.5654674357959717, 0.80974286765570169]
def MD_removeOutliers(x, y):
MD = MahalanobisDist(x, y)
threshold = np.mean(MD) * 1.5 # adjust 1.5 accordingly
nx, ny, outliers = [], [], []
for i in range(len(MD)):
if MD[i] <= threshold:
nx.append(x[i])
ny.append(y[i])
else:
outliers.append(i) # position of removed pair
return (np.array(nx), np.array(ny), np.array(outliers))
print 'x:', x
print 'y:', y
MD_removeOutliers(x,y)
x: [10 4 4 3 8 5 9 8 5 4] y: [ 2 4 4 4 10 3 12 4 6 4]
(array([4, 4, 3, 8, 5, 8, 5, 4]), array([ 4, 4, 4, 10, 3, 4, 6, 4]), array([0, 6]))
Pairs in position 6 and 8 were removed which were (6,9) where MD = 2.023 and (7,2) where MD = 2.059 Graphing them below:
%load_ext rmagic
#%reload_ext rmagic
%%R -i DF_diff_xy # list object to be transferred to python here
install.packages("ggplot2") # Had to add this for some reason, shouldn't be necessary
library(ggplot2)
df = data.frame(DF_diff_xy)
plot = ggplot(df, aes(x = X, y = Y)) +
geom_point(alpha = .8, color = 'dodgerblue',size = 5) +
geom_point(data=subset(df, Y >= 6.7 | X >= 4), color = 'red',size = 6) +
theme(axis.text.x = element_text(size= rel(1.5),angle=90, hjust=1)) +
ggtitle('Distance Pairs with outliers highlighted in red')
print(plot)
My attempt to plot a subset of the points in the yhat port of ggplot was not succesful. View the discussion here: https://github.com/yhat/ggplot/issues/116
import pandas as pd
DF_diff_xy = pd.DataFrame(diff_xy)
DF_diff_xy.rename(columns = lambda x: str(x), inplace=True)
DF_diff_xy.rename(columns={"0": "X"}, inplace=True) # rename a dfcolumn
DF_diff_xy.rename(columns={"1": "Y"}, inplace=True) # rename a dfcolumn
DF_diff_xy
X | Y | |
---|---|---|
0 | 4 | -3.3 |
1 | -2 | -1.3 |
2 | -2 | -1.3 |
3 | -3 | -1.3 |
4 | 2 | 4.7 |
5 | -1 | -2.3 |
6 | 3 | 6.7 |
7 | 2 | -1.3 |
8 | -1 | 0.7 |
9 | -2 | -1.3 |
from ggplot import *
ggplot(DF_diff_xy, aes(x = 'X', y ='Y')) + \
geom_point(alpha=1, size=100, color='dodgerblue') + \
geom_point(data = DF_diff_xy[:1],alpha=1, size = 100, color='red')
<ggplot: (275056345)>
plt.show()