%load_ext watermark
%watermark -v -p numpy -d -u
Last updated: 31/07/2014 CPython 3.4.1 IPython 2.1.0 numpy 1.8.1
More information about the watermark
magic command extension.
This is just a quick overview of how to deal with missing values (i.e., "NaN"s for "Not-a-Number") in NumPy and I am happy to expand it over time. Yes, and there will also be a separate one for pandas some time!
I would be happy to hear your comments and suggestions. Please feel free to drop me a note via twitter, email, or google+.
Let's assume that we have a CSV file with missing elements like the one shown below.
%%file example.csv
1,2,3,4
5,6,,8
10,11,12,
Writing example.csv
The np.genfromtxt
function has a missing_values
parameters which translates missing values into np.nan
objects by default. This allows us to construct a new NumPy ndarray
object, even if elements are missing.
import numpy as np
ary = np.genfromtxt('./example.csv', delimiter=',')
print('%s x %s array:\n' %(ary.shape[0], ary.shape[1]))
print(ary)
3 x 4 array: [[ 1. 2. 3. 4.] [ 5. 6. nan 8.] [ 10. 11. 12. nan]]
A handy function to test whether a value is a NaN
or not is to use the np.isnan
function.
np.isnan(np.nan)
True
It is especially useful to create boolean masks for the so-called "fancy indexing" of NumPy arrays, which we will come back to later.
np.isnan(ary)
array([[False, False, False, False], [False, False, True, False], [False, False, False, True]], dtype=bool)
In order to find out how many elements are missing in our array, we can use the np.isnan
function that we have seen in the previous section.
np.count_nonzero(np.isnan(ary))
2
If we want to determine the number of non-missing elements, we can simply revert the returned Boolean
mask via the handy "tilde" sign.
np.count_nonzero(~np.isnan(ary))
10
NaN
s¶As we will find out via the following code snippet, we can't use NumPy's regular sum
function to calculate the sum of an array.
np.sum(ary)
nan
Since the np.sum
function does not work, use np.nansum
instead:
print('total sum:', np.nansum(ary))
total sum: 62.0
print('column sums:', np.nansum(ary, axis=0))
column sums: [ 16. 19. 15. 12.]
print('row sums:', np.nansum(ary, axis=1))
row sums: [ 10. 19. 33.]
Here, we will use the Boolean mask
again to return only those rows that DON'T contain missing values. And if we want to get only the rows that contain NaN
s, we could simply drop the ~
.
ary[~np.isnan(ary).any(1)]
array([[ 1., 2., 3., 4.]])
Certain operations, algorithms, and other analyses might not work with NaN
objects in our data array. But that's not a problem: We can use the convenient np.nan_to_num
function will convert it to the value 0.
ary0 = np.nan_to_num(ary)
ary0
array([[ 1., 2., 3., 4.], [ 5., 6., 0., 8.], [ 10., 11., 12., 0.]])
Vice versa, we can also convert any number to a np.NaN
object. Here, we use the array that we created in the previous section and convert the 0
s back to np.nan
objects.
ary0[ary0==0] = np.nan
ary0
array([[ 1., 2., 3., 4.], [ 5., 6., nan, 8.], [ 10., 11., 12., nan]])
This is one is a little bit more tricky. We can remove missing values via a combination of the Boolean
mask and fancy indexing, however, this will have the disadvantage that it will flatten our array (we can't just punch holes into a NumPy array).
ary[~np.isnan(ary)]
array([ 1., 2., 3., 4., 5., 6., 8., 10., 11., 12.])
Thus, this is a method that would better work on individual rows:
x = np.array([1,2,np.nan])
x[~np.isnan(np.array(x))]
array([ 1., 2.])