In [1]:

%load_ext watermark

In [2]:

%watermark -v -p numpy -d -u

Last updated: 31/07/2014 

CPython 3.4.1
IPython 2.1.0

numpy 1.8.1

More information about the watermark magic command extension.

Quick guide for dealing with missing numbers in NumPy¶

This is just a quick overview of how to deal with missing values (i.e., "NaN"s for "Not-a-Number") in NumPy and I am happy to expand it over time. Yes, and there will also be a separate one for pandas some time!

I would be happy to hear your comments and suggestions. Please feel free to drop me a note via twitter, email, or google+.

Sections¶

Sample data from a CSV file
Determining if a value is missing
Counting the number of missing values
[Calculating the sum of an array that contains NaNs](#Calculating the sum of an array that contains NaNs)
Removing all rows that contain missing values
Convert missing values to 0
Converting certain numbers to NaN
Remove all missing elements from an array

Sample data from a CSV file¶

[back to top]

Let's assume that we have a CSV file with missing elements like the one shown below.

In [3]:

%%file example.csv
1,2,3,4
5,6,,8
10,11,12,

Writing example.csv

The np.genfromtxt function has a missing_values parameters which translates missing values into np.nan objects by default. This allows us to construct a new NumPy ndarray object, even if elements are missing.

In [4]:

import numpy as np
ary = np.genfromtxt('./example.csv', delimiter=',')

print('%s x %s array:\n' %(ary.shape[0], ary.shape[1]))
print(ary)

3 x 4 array:

[[  1.   2.   3.   4.]
 [  5.   6.  nan   8.]
 [ 10.  11.  12.  nan]]

Determining if a value is missing¶

[back to top]

A handy function to test whether a value is a NaN or not is to use the np.isnan function.

In [5]:

np.isnan(np.nan)

Out[5]:

True

It is especially useful to create boolean masks for the so-called "fancy indexing" of NumPy arrays, which we will come back to later.

In [6]:

np.isnan(ary)

Out[6]:

array([[False, False, False, False],
       [False, False,  True, False],
       [False, False, False,  True]], dtype=bool)

Counting the number of missing values¶

[back to top]

In order to find out how many elements are missing in our array, we can use the np.isnan function that we have seen in the previous section.

In [7]:

np.count_nonzero(np.isnan(ary))

Out[7]:

If we want to determine the number of non-missing elements, we can simply revert the returned Boolean mask via the handy "tilde" sign.

In [8]:

np.count_nonzero(~np.isnan(ary))

Out[8]:

Calculating the sum of an array that contains `NaN`s¶

[back to top]

As we will find out via the following code snippet, we can't use NumPy's regular sum function to calculate the sum of an array.

In [9]:

np.sum(ary)

Out[9]:

nan

Since the np.sum function does not work, use np.nansum instead:

In [10]:

print('total sum:', np.nansum(ary))

total sum: 62.0

In [11]:

print('column sums:', np.nansum(ary, axis=0))

column sums: [ 16.  19.  15.  12.]

In [12]:

print('row sums:', np.nansum(ary, axis=1))

row sums: [ 10.  19.  33.]

Removing all rows that contain missing values¶

[back to top]

Here, we will use the Boolean mask again to return only those rows that DON'T contain missing values. And if we want to get only the rows that contain NaNs, we could simply drop the ~.

In [14]:

ary[~np.isnan(ary).any(1)]

Out[14]:

array([[ 1.,  2.,  3.,  4.]])

Convert missing values to 0¶

[back to top]

Certain operations, algorithms, and other analyses might not work with NaN objects in our data array. But that's not a problem: We can use the convenient np.nan_to_num function will convert it to the value 0.

In [15]:

ary0 = np.nan_to_num(ary)
ary0

Out[15]:

array([[  1.,   2.,   3.,   4.],
       [  5.,   6.,   0.,   8.],
       [ 10.,  11.,  12.,   0.]])

Converting certain numbers to NaN¶

[back to top]

Vice versa, we can also convert any number to a np.NaN object. Here, we use the array that we created in the previous section and convert the 0s back to np.nan objects.

In [16]:

ary0[ary0==0] = np.nan
ary0

Out[16]:

array([[  1.,   2.,   3.,   4.],
       [  5.,   6.,  nan,   8.],
       [ 10.,  11.,  12.,  nan]])

Remove all missing elements from an array¶

[back to top]

This is one is a little bit more tricky. We can remove missing values via a combination of the Boolean mask and fancy indexing, however, this will have the disadvantage that it will flatten our array (we can't just punch holes into a NumPy array).

In [17]:

ary[~np.isnan(ary)]

Out[17]:

array([  1.,   2.,   3.,   4.,   5.,   6.,   8.,  10.,  11.,  12.])

Thus, this is a method that would better work on individual rows:

In [21]:

x = np.array([1,2,np.nan])

x[~np.isnan(np.array(x))]

Out[21]:

array([ 1.,  2.])

Quick guide for dealing with missing numbers in NumPy¶

Sections¶

Sample data from a CSV file¶

Determining if a value is missing¶

Counting the number of missing values¶

Calculating the sum of an array that contains NaNs¶

Removing all rows that contain missing values¶

Convert missing values to 0¶

Converting certain numbers to NaN¶

Remove all missing elements from an array¶

Calculating the sum of an array that contains `NaN`s¶