Mahalanobis Distance and Outliers blog post at http://kldavenport.com

Calculating the Mahalanobis Distance and detecting outliers¶

1. Generate two random 1-D arrays:¶

In [6]:

import numpy as np
# as column vectors
x = np.random.poisson(5,10)
y = np.random.poisson(5,10)

# examine the shape of the numpy arrays
print x.shape
print y.shape

(10,)
(10,)

2. Estimate a covariance matrix for (x,y):¶

An examination of what numpy's cov expects: A 1-D or 2-D array containing multiple variables and observations. Each row of m represents a variable, and each column a single observation of all those variables. In the event we are dealing with column vectors instead of row vectors, the rowvar parameter can be set to 1.

Here we calculate the covariance matrix:

In [7]:

covariance_xy = np.cov(x,y, rowvar=0)
inv_covariance_xy = np.linalg.inv(covariance_xy)

# Examine inverse covariance matrix and shape
print inv_covariance_xy
print inv_covariance_xy.shape

[[ 0.18716976 -0.05487061]
 [-0.05487061  0.11380572]]
(2, 2)

3. Center each value by the mean:¶

Center each value by the mean by subtracting the mean from i in array x and y.

In [8]:

xy_mean = np.mean(x),np.mean(y)
x_diff = np.array([x_i - xy_mean[0] for x_i in x])
y_diff = np.array([y_i - xy_mean[1] for y_i in y])
print x_diff.shape

(10,)

Join the x_diff and y_diff arrays into (10 x 2) array to be used in our formula later. Consider these coordinate pairs.

In [9]:

diff_xy = np.transpose([x_diff, y_diff])
diff_xy.shape, diff_xy

Out[9]:

((10, 2),
 array([[ 4. , -3.3],
       [-2. , -1.3],
       [-2. , -1.3],
       [-3. , -1.3],
       [ 2. ,  4.7],
       [-1. , -2.3],
       [ 3. ,  6.7],
       [ 2. , -1.3],
       [-1. ,  0.7],
       [-2. , -1.3]]))

4. Formula building¶

As stated above: $ MD = \sqrt{ (x - \mu)^T {\rm \bf \Sigma}^{-1} (x - \mu)}$,
So to put together this formula we'll need the square root of the transpose of our diff_xy array, the inverse of our covariance_xy matrix which we've assigned to inv_covariance_xy, and our original diff_xy array. Examining their shapes below for one final check before performing dot multiplication (remember u•v = v•u and u•(v + w) = u•w + v•w):

In [10]:

print np.transpose(diff_xy[1]).shape
print inv_covariance_xy.shape
print diff_xy.shape

(2,)
(2, 2)
(10, 2)

In [11]:

md = []
for i in range(len(diff_xy)):
    md.append(np.sqrt(np.dot(np.dot(np.transpose(diff_xy[i]),inv_covariance_xy),diff_xy[i])))
md

Out[11]:

[2.3838298236449629,
 0.80974286765570169,
 0.80974286765570169,
 1.2036896265134212,
 1.4936799521601556,
 0.73266444021370714,
 2.1418374874831754,
 1.107401415293582,
 0.5654674357959717,
 0.80974286765570169]

We can wrap all this sequential code up into one function as follows:¶

In [12]:

def MahalanobisDist(x, y):
    covariance_xy = np.cov(x,y, rowvar=0)
    inv_covariance_xy = np.linalg.inv(covariance_xy)
    xy_mean = np.mean(x),np.mean(y)
    x_diff = np.array([x_i - xy_mean[0] for x_i in x])
    y_diff = np.array([y_i - xy_mean[1] for y_i in y])
    diff_xy = np.transpose([x_diff, y_diff])
    
    md = []
    for i in range(len(diff_xy)):
        md.append(np.sqrt(np.dot(np.dot(np.transpose(diff_xy[i]),inv_covariance_xy),diff_xy[i])))
    return md

In [13]:

MahalanobisDist(x,y)

Out[13]:

[2.3838298236449629,
 0.80974286765570169,
 0.80974286765570169,
 1.2036896265134212,
 1.4936799521601556,
 0.73266444021370714,
 2.1418374874831754,
 1.107401415293582,
 0.5654674357959717,
 0.80974286765570169]

In [14]:

def MD_removeOutliers(x, y):
    MD = MahalanobisDist(x, y)
    threshold = np.mean(MD) * 1.5 # adjust 1.5 accordingly 
    nx, ny, outliers = [], [], []
    for i in range(len(MD)):
        if MD[i] <= threshold:
            nx.append(x[i])
            ny.append(y[i])
        else:
            outliers.append(i) # position of removed pair
    return (np.array(nx), np.array(ny), np.array(outliers))

In [15]:

print 'x:', x
print 'y:', y
MD_removeOutliers(x,y)

x: [10  4  4  3  8  5  9  8  5  4]
y: [ 2  4  4  4 10  3 12  4  6  4]

Out[15]:

(array([4, 4, 3, 8, 5, 8, 5, 4]),
 array([ 4,  4,  4, 10,  3,  4,  6,  4]),
 array([0, 6]))

Pairs in position 6 and 8 were removed which were (6,9) where MD = 2.023 and (7,2) where MD = 2.059 Graphing them below:

Plotting in R's ggplot2¶

In [16]:

%load_ext rmagic
#%reload_ext rmagic

In [28]:

%%R -i DF_diff_xy # list object to be transferred to python here
install.packages("ggplot2") # Had to add this for some reason, shouldn't be necessary
library(ggplot2)
df = data.frame(DF_diff_xy)
plot = ggplot(df, aes(x = X, y = Y)) + 
geom_point(alpha = .8, color = 'dodgerblue',size = 5) +
geom_point(data=subset(df, Y >= 6.7 | X >= 4), color = 'red',size = 6) +
theme(axis.text.x = element_text(size= rel(1.5),angle=90, hjust=1)) +
ggtitle('Distance Pairs with outliers highlighted in red')
print(plot)

Plotting in the python ggplot port from yhat¶

My attempt to plot a subset of the points in the yhat port of ggplot was not succesful. View the discussion here: https://github.com/yhat/ggplot/issues/116

In [95]:

import pandas as pd

DF_diff_xy = pd.DataFrame(diff_xy)
DF_diff_xy.rename(columns = lambda x: str(x), inplace=True)
DF_diff_xy.rename(columns={"0": "X"}, inplace=True) # rename a dfcolumn   
DF_diff_xy.rename(columns={"1": "Y"}, inplace=True) # rename a dfcolumn 
DF_diff_xy

Out[95]:

	X	Y
0	4	-3.3
1	-2	-1.3
2	-2	-1.3
3	-3	-1.3
4	2	4.7
5	-1	-2.3
6	3	6.7
7	2	-1.3
8	-1	0.7
9	-2	-1.3

In [96]:

from ggplot import *

ggplot(DF_diff_xy, aes(x = 'X', y ='Y')) + \
    geom_point(alpha=1, size=100, color='dodgerblue') + \
    geom_point(data = DF_diff_xy[:1],alpha=1, size = 100, color='red')    

Out[96]:

<ggplot: (275056345)>

In [94]:

plt.show()