We've put this together from our experience and a number of sources, please check the references at the bottom of this document.
The goal of this tutorial is to provide you with a hands-on overview of two of the main libraries from the scientific and data analysis communities. We're going to use:
What exactly are we going to do? Here's high-level overview:
It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.
import numpy as np
# set some print options
np.set_printoptions(precision=4)
np.set_printoptions(threshold=5)
np.set_printoptions(suppress=True)
# init random gen
np.random.seed(2)
Think of ndarrays as the building blocks for pydata. A multidimensional array object that acts as a container for data to be passed between algorithms. Also, libraries written in a lower-level language, such as C or Fortran, can operate on the data stored in a NumPy array without copying any data.
import numpy as np
# build an array using the array function
arr = np.array([0, 9, 5, 4, 3])
arr
array([0, 9, 5, 4, 3])
There are several functions that are used to create new arrays:
np.array
np.asarray
np.arange
np.ones
np.ones_like
np.zeros
np.zeros_like
np.empty
np.random.randn
and other funcs from the random modulenp.zeros(4)
array([ 0., 0., 0., 0.])
np.ones(4)
array([ 1., 1., 1., 1.])
np.empty(4)
array([ 0., 0., 0., 0.])
np.arange(4)
array([0, 1, 2, 3])
NumPy's arrays are containers of homogeneous data, which means all elements are of the same type. The 'dtype' propery is an object that specifies the data type of each element. The 'shape' property is a tuple that indicates the size of each dimension.
arr = np.random.randn(5)
arr
array([-0.4168, -0.0563, -2.1362, 1.6403, -1.7934])
arr.dtype
dtype('float64')
arr.shape
(5,)
# you can be explicit about the data type that you want
np.empty(4, dtype=np.int32)
array([ 0, 0, 0, 131072])
np.array(['numpy','pandas','pytables'], dtype=np.string_)
array(['numpy', 'pandas', 'pytables'], dtype='|S8')
float_arr = np.array([4.4, 5.52425, -0.1234, 98.1], dtype=np.float64)
# truncate the decimal part
float_arr.astype(np.int32)
array([ 4, 5, 0, 98])
arr = np.array([0, 9, 1, 4, 64])
arr[3]
4
arr[1:3]
array([9, 1])
arr[:2]
array([0, 9])
# set the last two elements to 555
arr[-2:] = 55
arr
array([ 0, 9, 1, 55, 55])
A good way to think about indexing in multidimensional arrays is that you are
moving along the values of the shape property. So, a 4d array arr_4d
, with a
shape of (w,x,y,z)
will result in indexed views such that:
arr_4d[i].shape == (x,y,z)
arr_4d[i,j].shape == (y,z)
arr_4d[i,j,k].shape == (z,)
For the case of slices, what you are doing is selecting a range of elements along a particular axis:
arr_2d = np.array([[5,3,4],[0,1,2],[1,1,10],[0,0,0.1]])
arr_2d
array([[ 5. , 3. , 4. ], [ 0. , 1. , 2. ], [ 1. , 1. , 10. ], [ 0. , 0. , 0.1]])
# get the first row
arr_2d[0]
array([ 5., 3., 4.])
# get the first column
arr_2d[:,0]
array([ 5., 0., 1., 0.])
# get the first two rows
arr_2d[:2]
array([[ 5., 3., 4.], [ 0., 1., 2.]])
A slice does not return a copy, which means that any modifications will be reflected in the source array. This is a design feature of NumPy to avoid memory problems.
arr = np.array([0, 3, 1, 4, 64])
arr
array([ 0, 3, 1, 4, 64])
subarr = arr[2:4]
subarr[1] = 99
arr
array([ 0, 3, 1, 99, 64])
Boolean indexing allows you to select data subsets of an array that satisfy a given condition.
arr = np.array([10, 20])
idx = np.array([True, False])
arr[idx]
array([10])
arr_2d = np.random.randn(5)
arr_2d
array([-0.8417, 0.5029, -1.2453, -1.058 , -0.909 ])
arr_2d < 0
array([ True, False, True, True, True], dtype=bool)
arr_2d[arr_2d < 0]
array([-0.8417, -1.2453, -1.058 , -0.909 ])
arr_2d[(arr_2d > -0.5) & (arr_2d < 0)]
array([], dtype=float64)
arr_2d[arr_2d < 0] = 0
arr_2d
array([ 0. , 0.5029, 0. , 0. , 0. ])
Fancy indexing is indexing with integer arrays.
arr = np.arange(18).reshape(6,3)
arr
array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11], [12, 13, 14], [15, 16, 17]])
# fancy selection of rows in a particular order
arr[[0,4,4]]
array([[ 0, 1, 2], [12, 13, 14], [12, 13, 14]])
# index into individual elements and flatten
arr[[5,3,1]]
array([[15, 16, 17], [ 9, 10, 11], [ 3, 4, 5]])
arr[[5,3,1],[2,1,0]]
array([17, 10, 3])
# select a submatrix
arr[np.ix_([5,3,1],[2,1])]
array([[17, 16], [11, 10], [ 5, 4]])
Vectorization is at the heart of NumPy and it enables us to express operations without writing any for loops. Operations between arrays with equal shapes are performed element-wise.
arr = np.array([0, 9, 1.02, 4, 32])
arr - arr
array([ 0., 0., 0., 0., 0.])
arr * arr
array([ 0. , 81. , 1.0404, 16. , 1024. ])
Vectorized operations between arrays of different sizes and between arrays and scalars are subject to the rules of broadcasting. The idea is quite simple in many cases:
arr = np.array([0, 9, 1.02, 4, 64])
5 * arr
array([ 0. , 45. , 5.1, 20. , 320. ])
10 + arr
array([ 10. , 19. , 11.02, 14. , 74. ])
arr ** .5
array([ 0. , 3. , 1.01, 2. , 8. ])
The case of arrays of different shapes is slightly more complicated. The gist of it is that the shape of the operands need to conform to a certain specification. Don't worry if this does not make sense right away.
arr = np.random.randn(4,2)
arr
array([[ 0.5515, 2.2922], [ 0.0415, -1.1179], [ 0.5391, -0.5962], [-0.0191, 1.175 ]])
mean_row = np.mean(arr, axis=0)
mean_row
array([ 0.2782, 0.4383])
centered_rows = arr - mean_row
centered_rows
array([[ 0.2732, 1.8539], [-0.2367, -1.5562], [ 0.2608, -1.0344], [-0.2974, 0.7367]])
np.mean(centered_rows, axis=0)
array([-0., 0.])
mean_col = np.mean(arr, axis=1)
mean_col
array([ 1.4218, -0.5382, -0.0286, 0.5779])
centered_cols = arr - mean_col
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-56-bd5236897883> in <module>() ----> 1 centered_cols = arr - mean_col ValueError: operands could not be broadcast together with shapes (4,2) (4)
# make the 1-D array a column vector
mean_col.reshape((4,1))
array([[ 1.4218], [-0.5382], [-0.0286], [ 0.5779]])
centered_cols = arr - mean_col.reshape((4,1))
centered_rows
array([[ 0.2732, 1.8539], [-0.2367, -1.5562], [ 0.2608, -1.0344], [-0.2974, 0.7367]])
centered_cols.mean(axis=1)
array([-0., 0., 0., -0.])
Per the floating point standard IEEE 754, NaN is a floating point value that, by definition, is not equal to any other floating point value.
np.nan != np.nan
True
np.array([10,5,4,np.nan,1,np.nan]) == np.nan
array([False, False, False, False, False, False], dtype=bool)
np.isnan(np.array([10,5,4,np.nan,1,np.nan]))
array([False, False, False, True, False, True], dtype=bool)
Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.
The heart of pandas is the DataFrame object for data manipulation. It features:
import pandas as pd
pd.set_printoptions(precision=3, notebook_repr_html=True)
The pandas Series is the simplest datastructure to start with. It is a subclass of ndarray that supports more meaninful indices.
import pandas as pd
values = np.array([2.0, 1.0, 5.0, 0.97, 3.0, 10.0, 0.0599, 8.0])
ser = pd.Series(values)
print ser
0 2.0000 1 1.0000 2 5.0000 3 0.9700 4 3.0000 5 10.0000 6 0.0599 7 8.0000 dtype: float64
values = np.array([2.0, 1.0, 5.0, 0.97, 3.0, 10.0, 0.0599, 8.0])
labels = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
ser = pd.Series(data=values, index=labels)
print ser
A 2.00 B 1.00 C 5.00 D 0.97 E 3.00 F 10.00 G 0.06 H 8.00 dtype: float64
movie_rating = {
'age': 1,
'gender': 'F',
'genres': 'Drama',
'movie_id': 1193,
'occupation': 10,
'rating': 5,
'timestamp': 978300760,
'title': "One Flew Over the Cuckoo's Nest (1975)",
'user_id': 1,
'zip': '48067'
}
ser = pd.Series(movie_rating)
print ser
age 1 gender F genres Drama movie_id 1193 occupation 10 rating 5 timestamp 978300760 title One Flew Over the Cuckoo's Nest (1975) user_id 1 zip 48067 dtype: object
ser.index
Index([u'age', u'gender', u'genres', u'movie_id', u'occupation', u'rating', u'timestamp', u'title', u'user_id', u'zip'], dtype=object)
ser.values
array([1, F, Drama, 1193, 10, 5, 978300760, One Flew Over the Cuckoo's Nest (1975), 1, 48067], dtype=object)
ser[0]
1
ser['gender']
'F'
ser.get_value('gender')
'F'
ser_1 = pd.Series(data=[1,3,4], index=['A', 'B', 'C'])
ser_2 = pd.Series(data=[5,5,5], index=['A', 'G', 'C'])
print ser_1 + ser_2
A 6 B NaN C 9 G NaN dtype: float64
# build from a dict of equal-length lists or ndarrays
pd.DataFrame({'col_1': [0.12, 7, 45, 10], 'col_2': [0.9, 9, 34, 11]})
col_1 | col_2 | |
---|---|---|
0 | 0.12 | 0.9 |
1 | 7.00 | 9.0 |
2 | 45.00 | 34.0 |
3 | 10.00 | 11.0 |
You can explicitly set the column names and index values as well.
pd.DataFrame(data={'col_1': [0.12, 7, 45, 10], 'col_2': [0.9, 9, 34, 11]},
columns=['col_1', 'col_2', 'col_3'])
col_1 | col_2 | col_3 | |
---|---|---|---|
0 | 0.12 | 0.9 | NaN |
1 | 7.00 | 9.0 | NaN |
2 | 45.00 | 34.0 | NaN |
3 | 10.00 | 11.0 | NaN |
pd.DataFrame(data={'col_1': [0.12, 7, 45, 10], 'col_2': [0.9, 9, 34, 11]},
columns=['col_1', 'col_2', 'col_3'],
index=['obs1', 'obs2', 'obs3', 'obs4'])
col_1 | col_2 | col_3 | |
---|---|---|---|
obs1 | 0.12 | 0.9 | NaN |
obs2 | 7.00 | 9.0 | NaN |
obs3 | 45.00 | 34.0 | NaN |
obs4 | 10.00 | 11.0 | NaN |
You can also think of it as a dictionary of Series objects.
movie_rating = {
'gender': 'F',
'genres': 'Drama',
'movie_id': 1193,
'rating': 5,
'timestamp': 978300760,
'user_id': 1,
}
ser_1 = pd.Series(movie_rating)
ser_2 = pd.Series(movie_rating)
df = pd.DataFrame({'r_1': ser_1, 'r_2': ser_2})
df.columns.name = 'rating_events'
df.index.name = 'rating_data'
df
rating_events | r_1 | r_2 |
---|---|---|
rating_data | ||
gender | F | F |
genres | Drama | Drama |
movie_id | 1193 | 1193 |
rating | 5 | 5 |
timestamp | 978300760 | 978300760 |
user_id | 1 | 1 |
df = df.T
df
rating_data | gender | genres | movie_id | rating | timestamp | user_id |
---|---|---|---|---|---|---|
rating_events | ||||||
r_1 | F | Drama | 1193 | 5 | 978300760 | 1 |
r_2 | F | Drama | 1193 | 5 | 978300760 | 1 |
df.columns
Index([u'gender', u'genres', u'movie_id', u'rating', u'timestamp', u'user_id'], dtype=object)
df.index
Index([u'r_1', u'r_2'], dtype=object)
df.values
array([[F, Drama, 1193, 5, 978300760, 1], [F, Drama, 1193, 5, 978300760, 1]], dtype=object)
df = pd.DataFrame({'r_1': ser_1, 'r_2': ser_2})
df.drop('genres', axis=0)
r_1 | r_2 | |
---|---|---|
rating_data | ||
gender | F | F |
movie_id | 1193 | 1193 |
rating | 5 | 5 |
timestamp | 978300760 | 978300760 |
user_id | 1 | 1 |
df.drop('r_1', axis=1)
r_2 | |
---|---|
rating_data | |
gender | F |
genres | Drama |
movie_id | 1193 |
rating | 5 |
timestamp | 978300760 |
user_id | 1 |
# careful with the order here
df['r_3'] = ['F', 'Drama', 1193, 5, 978300760, 1]
df
r_1 | r_2 | r_3 | |
---|---|---|---|
rating_data | |||
gender | F | F | F |
genres | Drama | Drama | Drama |
movie_id | 1193 | 1193 | 1193 |
rating | 5 | 5 | 5 |
timestamp | 978300760 | 978300760 | 978300760 |
user_id | 1 | 1 | 1 |
You can index into a column using it's label, or with dot notation
df = pd.DataFrame(data={'col_1': [0.12, 7, 45, 10], 'col_2': [0.9, 9, 34, 11]},
columns=['col_1', 'col_2', 'col_3'],
index=['obs1', 'obs2', 'obs3', 'obs4'])
df['col_1']
obs1 0.12 obs2 7.00 obs3 45.00 obs4 10.00 Name: col_1, dtype: float64
df.col_1
obs1 0.12 obs2 7.00 obs3 45.00 obs4 10.00 Name: col_1, dtype: float64
You can also use multiple columns to select a subset of them:
df[['col_2', 'col_1']]
col_2 | col_1 | |
---|---|---|
obs1 | 0.9 | 0.12 |
obs2 | 9.0 | 7.00 |
obs3 | 34.0 | 45.00 |
obs4 | 11.0 | 10.00 |
The .ix method gives you the most flexibility to index into certain rows, or even rows and columns:
df.ix['obs3']
col_1 45 col_2 34 col_3 NaN Name: obs3, dtype: object
df.ix[0]
col_1 0.12 col_2 0.9 col_3 NaN Name: obs1, dtype: object
df.ix[:2]
col_1 | col_2 | col_3 | |
---|---|---|---|
obs1 | 0.12 | 0.9 | NaN |
obs2 | 7.00 | 9.0 | NaN |
df.ix[:2, 'col_2']
obs1 0.9 obs2 9.0 Name: col_2, dtype: float64
df.ix[:2, ['col_1', 'col_2']]
col_1 | col_2 | |
---|---|---|
obs1 | 0.12 | 0.9 |
obs2 | 7.00 | 9.0 |
Loading of the MovieLens dataset here is based on the intro chapter of 'Python for Data Analysis".
The MovieLens data is spread across three files. Using the pd.read_table
method we load each file:
import pandas as pd
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('../data/ml-1m/users.dat',
sep='::', header=None, names=unames)
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('../data/ml-1m/ratings.dat',
sep='::', header=None, names=rnames)
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('../data/ml-1m/movies.dat',
sep='::', header=None, names=mnames)
# show how one of them looks
ratings.head(5)
user_id | movie_id | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 1193 | 5 | 978300760 |
1 | 1 | 661 | 3 | 978302109 |
2 | 1 | 914 | 3 | 978301968 |
3 | 1 | 3408 | 4 | 978300275 |
4 | 1 | 2355 | 5 | 978824291 |