In [1]:
%load_ext watermark
In [2]:
%watermark -v -p numpy -d -u
Last updated: 31/07/2014 

CPython 3.4.1
IPython 2.1.0

numpy 1.8.1

[More information]( about the `watermark` magic command extension.

Quick guide for dealing with missing numbers in NumPy

This is just a quick overview of how to deal with missing values (i.e., "NaN"s for "Not-a-Number") in NumPy and I am happy to expand it over time. Yes, and there will also be a separate one for pandas some time!

I would be happy to hear your comments and suggestions. Please feel free to drop me a note via twitter, email, or google+.


Sample data from a CSV file

Let's assume that we have a CSV file with missing elements like the one shown below.

In [3]:
%%file example.csv
Writing example.csv

The np.genfromtxt function has a missing_values parameters which translates missing values into np.nan objects by default. This allows us to construct a new NumPy ndarray object, even if elements are missing.

In [4]:
import numpy as np
ary = np.genfromtxt('./example.csv', delimiter=',')

print('%s x %s array:\n' %(ary.shape[0], ary.shape[1]))
3 x 4 array:

[[  1.   2.   3.   4.]
 [  5.   6.  nan   8.]
 [ 10.  11.  12.  nan]]

Determining if a value is missing

A handy function to test whether a value is a NaN or not is to use the np.isnan function.

In [5]:

It is especially useful to create boolean masks for the so-called "fancy indexing" of NumPy arrays, which we will come back to later.

In [6]:
array([[False, False, False, False],
       [False, False,  True, False],
       [False, False, False,  True]], dtype=bool)

Counting the number of missing values

In order to find out how many elements are missing in our array, we can use the np.isnan function that we have seen in the previous section.

In [7]:

If we want to determine the number of non-missing elements, we can simply revert the returned Boolean mask via the handy "tilde" sign.

In [8]:

Calculating the sum of an array that contains NaNs

As we will find out via the following code snippet, we can't use NumPy's regular sum function to calculate the sum of an array.

In [9]:

Since the np.sum function does not work, use np.nansum instead:

In [10]:
print('total sum:', np.nansum(ary))
total sum: 62.0
In [11]:
print('column sums:', np.nansum(ary, axis=0))
column sums: [ 16.  19.  15.  12.]
In [12]:
print('row sums:', np.nansum(ary, axis=1))
row sums: [ 10.  19.  33.]

Removing all rows that contain missing values

Here, we will use the Boolean mask again to return only those rows that DON'T contain missing values. And if we want to get only the rows that contain NaNs, we could simply drop the ~.

In [14]:
array([[ 1.,  2.,  3.,  4.]])

Convert missing values to 0

Certain operations, algorithms, and other analyses might not work with NaN objects in our data array. But that's not a problem: We can use the convenient np.nan_to_num function will convert it to the value 0.

In [15]:
ary0 = np.nan_to_num(ary)
array([[  1.,   2.,   3.,   4.],
       [  5.,   6.,   0.,   8.],
       [ 10.,  11.,  12.,   0.]])

Converting certain numbers to NaN

Vice versa, we can also convert any number to a np.NaN object. Here, we use the array that we created in the previous section and convert the 0s back to np.nan objects.

In [16]:
ary0[ary0==0] = np.nan
array([[  1.,   2.,   3.,   4.],
       [  5.,   6.,  nan,   8.],
       [ 10.,  11.,  12.,  nan]])

Remove all missing elements from an array

This is one is a little bit more tricky. We can remove missing values via a combination of the Boolean mask and fancy indexing, however, this will have the disadvantage that it will flatten our array (we can't just punch holes into a NumPy array).

In [17]:
array([  1.,   2.,   3.,   4.,   5.,   6.,   8.,  10.,  11.,  12.])

Thus, this is a method that would better work on individual rows:

In [21]:
x = np.array([1,2,np.nan])

array([ 1.,  2.])