Data frames are a great choice when your data can be organized as a table (i.e., laid out in rows and columns), but what if your data are naturally multidimensional? What if you need to perform serious number crunching — matrix algebra, signal, or image processing — in order to get to a result? For these needs the NumPy package provides a general data type, the n-dimensional array or ndarray
. This data type is so powerful that it forms the basis for most serious numerical processing in Python, including Pandas, SciPy, and image processing packages like scikit-image.
In this lesson, we'll get you used to working with arrays: how to learn about them, select sets of data within them, and plot them. Again, we'll only cover the basics, but the NumPy documentation is thorough and easy to search on the web.
Lists are one of the fundamental data types in Python. A list is, well, a list of objects:
a = 1.5
b = 'spam'
mylist = [2, 7, a, 'eggs', b]
print(mylist)
[2, 7, 1.5, 'eggs', 'spam']
Lists are defined with using square brackets, and can contain variables of any type, separated by commas.
Again, elements of a list can be anything even other lists:
another_list = ['pooh', 'piglet']
mylist.append(another_list) # we can use this to append to a list (in-place)
print(mylist)
print(len(mylist))
mylist = mylist + [1, 1.7, -2] # concatenation is easy!
print(mylist)
[2, 7, 1.5, 'eggs', 'spam', ['pooh', 'piglet']] 6 [2, 7, 1.5, 'eggs', 'spam', ['pooh', 'piglet'], 1, 1.7, -2]
We can access the elements of a list using square brackets, just as we did with Pandas. Python is a very consistent language, so the pattern "get an item using square brackets" will show up over and over again.
print(mylist[0]) # first element
print(mylist[2]) # third element
print(mylist[-1]) # last element
print(mylist[-2]) # penultimate element
2 1.5 -2 1.7
We can also use a technique called slicing to get subsequences from the list:
print(mylist[:]) # all elements
print(mylist[1:3]) # elements >= 1 and < 3 (note that element 3 is *not* included)
print(mylist[:-1]) # all but last element
print(mylist[3:]) # elements 3 to end
print(mylist[::2]) # every other element
[2, 7, 1.5, 'eggs', 'spam', ['pooh', 'piglet'], 1, 1.7, -2] [7, 1.5] [2, 7, 1.5, 'eggs', 'spam', ['pooh', 'piglet'], 1, 1.7] ['eggs', 'spam', ['pooh', 'piglet'], 1, 1.7, -2] [2, 1.5, 'spam', 1, -2]
Note that we can use a slice object inside the brackets. This object is of the form start:stop:step
. Note that the start
element is included, but the stop
element is not. Any of these arguments can be omitted, in which case
start
is assumed to be 0stop
is assumed to be len(mylist)
step
is assumed to be 1In cases where we have a matrix of data we want to represent, a natural way to do that is as a list of lists:
mydata = [[1, 2, 3], [4, 5, 6]]
print(mydata)
print(len(mydata))
[[1, 2, 3], [4, 5, 6]] 2
This is a nested list. It has two elements:
mydata[0]
[1, 2, 3]
and we can get individual data by accessing the list inside a list
print(mydata[0][1])
2
We can think of this in analogy with a 2 x 3 matrix, where the first index gives the row, and the second index gives the column.
In this case, mydata[0][1]
instructs us to get the 0th element of mydata
(the first row), and the 1st element from that (the second column).
However, there are some drawbacks for using lists to represent large arrays of data:
mydata[0][1]
does, but for arrays with many dimensions, we might like to write something simpler, like mydata[0, 1]
. Doing this with lists gives an error.The NumPy package resolves these issues by defining the ndarray
...
If we already have a list of lists, NumPy gives us a clear way to convert it to an array:
import numpy as np # this nickname (numpy = np) is so common as to be ubiquitous
myarray = np.array(mydata)
print(myarray)
print(type(myarray))
[[1 2 3] [4 5 6]] <class 'numpy.ndarray'>
Note that the myarray
variable is a numpy.ndarray
, and that it prints a little prettier than the list of lists, making clear that it's a 2 x 3 matrix. In fact, just like with Pandas, we can find out the dimensions of an array with the shape
attribute (note again the consistency of syntax across data types in Python):
print(myarray.shape)
(2, 3)
We can use some of the same syntax as before:
print(myarray[0])
print(myarray[0].shape)
[1 2 3] (3,)
Just as before, the 0th element is the first row. In this case, however, the 0th element is an array of shape (3,)
, a vector of length 3. We will see that selecting rows or columns from an array returns a view of that array, with dimensions correspondingly reduced.
Along the same lines:
print(myarray[1])
print(myarray[1][0])
print(myarray[1][1:])
[4 5 6] 4 [5 6]
But we can now use a more natural notation:
print(myarray[1, 0])
print(myarray[1, 1:])
print(myarray[:, 0])
4 [5 6] [1 4]
And this is just the tip of the iceberg. You can use slicing along any dimension of an array, arrays to index into arrays, and much more. See the SciPy docs for details.
But let's move on to a real example:
Get the data from here and place it in a folder called data
in your working directory.
These data are in .npy
format, which is NumPy's native array format. Numpy is capable of saving arrays in compressed format (.gz
) or in plain text (.txt
) if arrays are 2-dimensional. Bear in mind that NumPy and SciPy can also be used to read in data from other common formats including those used by Matlab.
Now let's load the data
arr = np.load('data/arraydata.npy')
Well, that was easy!
Let's see what we can learn about this variable we just loaded data into:
print(type(arr))
print(arr.shape)
<class 'numpy.ndarray'> (91, 109, 91)
We see that this type of object is a numpy.ndarray
, as expected, and that its shape is given by three numbers. That is, it's a three-dimensional array!
What does it look like?
print(arr)
[[[0 0 0 ..., 0 0 0] [0 0 0 ..., 0 0 0] [0 0 0 ..., 0 0 0] ..., [0 0 0 ..., 0 0 0] [0 0 0 ..., 0 0 0] [0 0 0 ..., 0 0 0]] [[0 0 0 ..., 0 0 0] [0 0 0 ..., 0 0 0] [0 0 0 ..., 0 0 0] ..., [0 0 0 ..., 0 0 0] [0 0 0 ..., 0 0 0] [0 0 0 ..., 0 0 0]] [[0 0 0 ..., 0 0 0] [0 0 0 ..., 0 0 0] [0 0 0 ..., 0 0 0] ..., [0 0 0 ..., 0 0 0] [0 0 0 ..., 0 0 0] [0 0 0 ..., 0 0 0]] ..., [[0 0 0 ..., 0 0 0] [0 0 0 ..., 0 0 0] [0 0 0 ..., 0 0 0] ..., [0 0 0 ..., 0 0 0] [0 0 0 ..., 0 0 0] [0 0 0 ..., 0 0 0]] [[0 0 0 ..., 0 0 0] [0 0 0 ..., 0 0 0] [0 0 0 ..., 0 0 0] ..., [0 0 0 ..., 0 0 0] [0 0 0 ..., 0 0 0] [0 0 0 ..., 0 0 0]] [[0 0 0 ..., 0 0 0] [0 0 0 ..., 0 0 0] [0 0 0 ..., 0 0 0] ..., [0 0 0 ..., 0 0 0] [0 0 0 ..., 0 0 0] [0 0 0 ..., 0 0 0]]]
That probably doesn't seem very helpful. In general, with large arrays, we simply can't look at all the data. As usual, the way we will address this is by plotting. However, we can't plot all the dimensions of the array at once...
import matplotlib.pyplot as plt # this is also common; pyplot contains most of the common functions
%matplotlib inline
So we would like to plot our data (to see it better), but first, we need to select a subset of the data we can actually plot. The way we will do this is by taking slices through the data. For instance, if we do
slc = arr[64, :, :]
print(slc.shape)
(109, 91)
Just as before, we've asked for position 64 along axis 0, with all elements in axes 1 and 2. Just as with a coordinate grid, fixing one coordinate has left us two coordinates unspecified, so the result is a 2d array.
To view this array, we can think of plotting it as an image, where the value of the array entry becomes an image intensity value:
# the matplotlib functiont to show a matrix is matshow
plt.matshow(slc)
<matplotlib.image.AxesImage at 0x11283b4e0>
Hmmmm. What if we instead plotted this as an image? (matshow
is mostly for looking at matrices. The following command is better for visualizing data that are a gridded version of smooth information, like an image. You may often want to try this both ways.)
plt.imshow(slc, cmap='gray') # cmap is for colormap
<matplotlib.image.AxesImage at 0x112dc02b0>
Let's take another slice, this time letting axes 0 and 1 vary:
plt.imshow(arr[:, 48, :], cmap='gray')
<matplotlib.image.AxesImage at 0x112eeef98>
And one more...
plt.imshow(arr[:, :, 37], cmap='gray')
<matplotlib.image.AxesImage at 0x112fca5c0>
So the array in question is actually a 3d image! And by fixing one of the coordinate dimensions of the array, we have reduced the 3d image to a 2d image. Note also a few important conventions:
matshow
and imshow
follow the matrix convention, in which the first index is for rows and the second index is for columns. This is typically what we want: the rows of the matrix become the rows of the image.The orbitofrontal cortex is an area of the brain known to be involved in a large number of cognitive functions, particularly related to reward and decision making. Suppose you perform a study and find a large BOLD (blood oxygenation level-dependent) signal response at coordinates (50, 85, 25).
Plot three sections through the brain that contain this point, showing the point with a bright red dot in each one.
Hints:
plt.figure(figsize=(15, 15))
before you plot anything. This will make the figure size bigger and easier to see.
subplot
command (example) to arrange the three plots into one. You will need to specify how many rows your images should be arranged into, how many columns, and which image you want (note that, as opposed to typical Python, these indices start at 1).plt.scatter
(example) to place a single dot on each image.color
and s
(size) arguments.cc = (50, 85, 25)
plt.figure(figsize=(15, 15))
plt.subplot(1, 3, 1)
plt.imshow(arr[cc[0], :, :], cmap='gray')
plt.scatter(cc[2], cc[1], color='r', s=30)
plt.subplot(1, 3, 2)
plt.imshow(arr[:, cc[1], :], cmap='gray')
plt.scatter(cc[2], cc[0], color='r', s=30)
plt.subplot(1, 3, 3)
plt.imshow(arr[:, :, cc[2]], cmap='gray')
plt.scatter(cc[1], cc[0], color='r', s=30)
<matplotlib.collections.PathCollection at 0x1131d3d30>
The benefit of storing data in arrays is not merely that it makes organizing multidimensional data easier. DataFrames have many more advantages in this way and automate many powerful transformations. Instead, arrays are primarily useful to us because they make numerical operations easier.
Consider, for example, the following array
aa = np.array([[5, -7, 3.4], [6, -1, -1]])
print(aa)
[[ 5. -7. 3.4] [ 6. -1. -1. ]]
If we write something like
aa + 1.1
we typically mean that we want to add 1.1 to each entry, and this is exactly what NumPy does:
aa + 1.1
array([[ 6.1, -5.9, 4.5], [ 7.1, 0.1, 0.1]])
But NumPy is much smarter than that. If we creat two more arrays
bb = np.array([3, -3])
cc = np.array([7, 4, 0])
We might want to add them, but addition is typically only defined for matrices of the same shape:
print(aa.shape)
print(bb.shape)
(2, 3) (2,)
Now what we could want in this case is for NumPy to treat bb
as having shape (2, 1) (a column vector) and simply add this column to each column of aa
. This is mathematically equivalent to creating a new array of shape (2, 3) with one copy of bb
in each column, which is then the same size as aa
.
That is, we could do
# make a matrix by repeating the bb column twice
bbb = np.c_[bb, bb, bb] # concatenate along columns
print(bbb)
print(bbb.shape)
print(bbb + aa)
[[ 3 3 3] [-3 -3 -3]] (2, 3) [[ 8. -4. 6.4] [ 3. -4. -4. ]]
But clearly, this is difficult to type out if aa
is of shape (2, 5000) and bb
is of shape (2,). If we ask NumPy to multiply these things, it should make some reasonable guesses as to how to expand the dimensions of the arrays in order to do this and complain if it can't.
This feature of NumPy goes under the name of broadcasting. There are very detailed rules about how NumPy makes the shapes of different arrays match up, which are detailed here. In the simplest cases this, happens like so:
print(bb.shape)
print(bb[:, np.newaxis].shape)
print(aa + bb[:, np.newaxis])
(2,) (2, 1) [[ 8. -4. 6.4] [ 3. -4. -4. ]]
Basically, we use the np.newaxis
keyword to add an additional dimension (of size 1) to bb
in the position we choose, making bb
of shape (2, 1), and then NumPy sees that aa
and bb
share a first dimension and makes a good guess about how to broadcast them together.
Similarly, the following also works:
print(aa + cc[np.newaxis, :])
[[ 12. -3. 3.4] [ 13. 3. -1. ]]
Clearly, another thing we might want to do to our arrays is change their entries. In the case of changing single entries, this is pretty easy to do:
print(aa)
aa[0][1] = 5
print('-----------------')
print(aa)
[[ 5. -7. 3.4] [ 6. -1. -1. ]] ----------------- [[ 5. 5. 3.4] [ 6. -1. -1. ]]
But we can assign slices as well:
print(aa)
aa[1] = 0.1
print('-----------------')
print(aa)
[[ 5. 5. 3.4] [ 6. -1. -1. ]] ----------------- [[ 5. 5. 3.4] [ 0.1 0.1 0.1]]
print(aa)
aa[1] = [1, 2, 3]
print('-----------------')
print(aa)
[[ 5. 5. 3.4] [ 0.1 0.1 0.1]] ----------------- [[ 5. 5. 3.4] [ 1. 2. 3. ]]
Note that in the latter case, we've actually assigned a list, not an array, to the slice aa[1]
, but NumPy is smart enough to convert it to an array before assigning it.
Important: Notice how each time we perform an assignment above, we altered the original array, aa
. This is almost always what we want. Often, our arrays are very large, and it costs less memory to only have one copy of a giant array, with different variables referring to slices of that array (like addresses). In NumPy parlance, we say that indexing or slicing returns a view of the array, not a copy.
Be careful this distinction doesn't bite you. For instance
aaa = aa[0]
print(aa)
print('-----------------')
print(aaa)
print('-----------------')
aaa[2] = -100
print(aa)
print('-----------------')
print(aaa)
[[ 5. 5. 3.4] [ 1. 2. 3. ]] ----------------- [ 5. 5. 3.4] ----------------- [[ 5. 5. -100.] [ 1. 2. 3.]] ----------------- [ 5. 5. -100.]
Here, we simply assigned the name aaa
to a view of aa
; both variables still pointed to the same place in memory, so when we changed its value, it changed for both variables.
Often, this is what we want. When we don't it's best to make a copy:
aaa = aa[0].copy()
print(aa)
print('-----------------')
print(aaa)
print('-----------------')
aaa[2] = 501
print(aa)
print('-----------------')
print(aaa)
[[ 5. 5. -100.] [ 1. 2. 3.]] ----------------- [ 5. 5. -100.] ----------------- [[ 5. 5. -100.] [ 1. 2. 3.]] ----------------- [ 5. 5. 501.]
Finally, there are times when you would like to change which dimension corresponds to rows and which to columns (for instance, flipping an image around the diagonal). In the 2d case, this is equivalent to matrix transposition:
print(aa)
print('-----------------')
print(aa.T)
print('-----------------')
print(aa.shape, aa.T.shape)
[[ 5. 5. -100.] [ 1. 2. 3.]] ----------------- [[ 5. 1.] [ 5. 2.] [-100. 3.]] ----------------- (2, 3) (3, 2)
For a matrix with more than two dimensions, transposition reverses the list of dimensions, so that the last becomes first, first becomes last, etc. You can also specify a custom reordering (see here).
randarr = np.random.rand(5, 3, 7, 2, 6) # a 5 x 3 x 7 array of random numbers
print(randarr.shape, randarr.T.shape)
(5, 3, 7, 2, 6) (6, 2, 7, 3, 5)
On computers, the simplest type of color image encoding is an M x N x 3 array. This represents an M x N image where each of the final dimensions represents a unique color channel (red, green, and blue). Matplotlib understands this, and will handle this automatically.
Download any png file from the internet and save it to your working directory. (You may try using a .jpg
or other file formats, but they are not guaranteed to work.) Then use imread
and imshow
to display it in the notebook like so:
img = plt.imread('marmoset.jpg')
print(img.shape)
plt.imshow(img)
(153, 128, 3)
<matplotlib.image.AxesImage at 0x113fc4b70>
Now use what you've learned about arrays to adjust the red-green-blue color balance of the image.
img2 = img.copy()
img2[:, :, 2] = 0
plt.imshow(img2)
<matplotlib.image.AxesImage at 0x11402f4a8>
img3 = img
img3[:, :, 0] = img3[:, :, 0] * 1.3
plt.imshow(img3)
<matplotlib.image.AxesImage at 0x11420e908>
plt.imshow(img)
<matplotlib.image.AxesImage at 0x1142e3d30>
If you're interested in learning more about images in Matplotlib, you can read the tutorial here.
NumPy is a powerful library, with incredibly flexible syntax, but for beginners, the code can be a little cryptic. Some suggested reading:
If you analyze data for any amount of time, you will have to deal with missing data. Sometimes data are lost — subjects withdraw, files are lost, or a given condition was not run for a particular session — and in that case, we would like a way to indicate that a matrix of data has missing entries.
Other times, we have data, but some quantity we calculate turns out to be ill-defined. For instance, we might want to standardize our data, subtracting the mean and dividing by the standard deviation (this results in data with mean 0 and standard deviation 1 — a convenient thing to have).
But what if we only have one data point? In this case the standard deviation of that data set is 0, and division by 0 is undefined.
Clearly these are different types of situations. In the R language, the first case of missing data is coded NA
(not available), while the second case is coded NaN
(not a number).
Python handles the second case in exactly the same way (NaN
is an international standard), whereas the second case is handled with masking, in the form of NumPy's masked arrays.
print(np.array([1., np.inf, -7, 0]) / 0.0)
print(np.inf / np.inf)
[ inf inf -inf nan] nan
/Users/jmxp/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:1: RuntimeWarning: divide by zero encountered in true_divide if __name__ == '__main__': /Users/jmxp/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:1: RuntimeWarning: invalid value encountered in true_divide if __name__ == '__main__':
Note that some constants like $\infty$ and NaN
are accessible within the NumPy package as np.inf
, np.nan
.
# we can test for nan:
myarr = np.array([1, -3, np.nan, 8, np.nan])
np.isnan(myarr)
array([False, False, True, False, True], dtype=bool)
We can use a boolean (True/False) array to index into a data array of the same size. When the index array is true, we pull out the corresponding elements of the data array:
nan_arr = np.isnan(myarr)
# get entries that are not NaN (~ is negation)
myarr[~nan_arr]
array([ 1., -3., 8.])
Something that often happens is that we want to ignore NaN
s in our data. Perhaps we want to take a mean that does not take into account any of these undefined entries:
print(myarr.mean())
print(np.nanmean(myarr))
nan 2.0
In addition, Matplotlib ignores NaN
entries when plotting:
xx = np.linspace(0, 10, 500) # make 500 uniformly spaced points between 0 and 10
yy = 1.5 * xx - 2
plt.plot(xx, yy)
[<matplotlib.lines.Line2D at 0x114509cf8>]
yy[240:300] = np.nan
plt.plot(xx, yy)
[<matplotlib.lines.Line2D at 0x1145d86d8>]
Masking is a way of adding additional information to an ndarray
to tell Python that a given entry is missing or otherwise invalid. We might want to do this for several reasons:
In all of these cases, we might want to mask the invalid data.
For example:
npts = 500
xx = np.linspace(0, 10, npts)
yy = np.sin(xx) + 0.8
plt.plot(xx, yy)
[<matplotlib.lines.Line2D at 0x114823470>]
Let's say all the readings below 0 are invalid. We can fix this. The numpy.ma
module contains functions relating to masked arrays, including functions to mask data inside a range, greater or less than a given value, or where a particular truth condition is met:
masked_yy = np.ma.masked_less(yy, 0) # mask where yy < 0
print(yy.mean())
print(masked_yy.mean())
0.982989174926 1.14678171355
Clearly, the second average is higher. This is because it excludes all the negative numbers.
Similarly, plots exclude masked values:
plt.plot(xx, masked_yy)
[<matplotlib.lines.Line2D at 0x1148dd978>]
This is potentially most useful with images. For instance, we can mask our brain images from earlier by constructing the mask ourselves:
# the mask should be logical and of the same size as arr
# here, we set all entries to True, which masks all data by default
mask = np.ones(arr.shape, dtype='bool')
# the mask should be true for the data we *don't* want to use
# unmask the window: x in [60, 75), y in [50, 90), z in [20, 60)
mask[60:75, 50:90, 20:60] = False
masked_arr = np.ma.masked_array(arr, mask=mask)
cc = (65, 85, 25)
plt.figure(figsize=(15, 15))
plt.subplot(1, 3, 1)
plt.imshow(masked_arr[cc[0], :, :], cmap='gray')
plt.subplot(1, 3, 2)
plt.imshow(masked_arr[:, cc[1], :], cmap='gray')
plt.subplot(1, 3, 3)
plt.imshow(masked_arr[:, :, cc[2]], cmap='gray')
<matplotlib.image.AxesImage at 0x114be6358>
In this way, we can restrict our analysis to a particular region of interest inside the image; all other data will be ignored. In real MRI analysis, the mask is often based on anatomy, so that we restrict our data to a particular brain structure or structures.
For more on this, there are lots of great resources: