A Beginner's Introduction to Pandas¶

Welcome!¶

About me! Who is @marcelcaraciolo ?
Environment + data files check: http://marcelcaraciolo.github.io/big-data-tutorial
Wakari.io check!

About this tutorial¶

We've put this together from our experience and a number of sources, please check the references at the bottom of this document.

What this tutorial is¶

The goal of this tutorial is to provide you with a hands-on overview of two of the main libraries from the scientific and data analysis communities. We're going to use:

ipython -- ipython.org
numpy -- numpy.org
pandas -- pandas.pydata.org
scikit-learn -- [scikit-learn.org] (http://scikit-learn.org)
mrJob -- [http://pythonhosted.org/mrjob/] (http://pythonhosted.org/mrjob/)
(bonus) pytables -- pytables.org

What this tutorial is not¶

A machine learning course
A python course
Advanced Big Data Course
An exhaustive overview of the recommendation literature
A set of recipes that will win you the next Netflix/Kaggle/? challenge.

Roadmap¶

What exactly are we going to do? Here's high-level overview:

learn about NumPy arrays
learn about DataFrames

NumPy: Numerical Python (30 min)¶

What is it?¶

It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

In [2]:

import numpy as np

# set some print options
np.set_printoptions(precision=4)
np.set_printoptions(threshold=5)
np.set_printoptions(suppress=True)

# init random gen
np.random.seed(2)

NumPy's basic data structure: the ndarray¶

Think of ndarrays as the building blocks for pydata. A multidimensional array object that acts as a container for data to be passed between algorithms. Also, libraries written in a lower-level language, such as C or Fortran, can operate on the data stored in a NumPy array without copying any data.

In [7]:

import numpy as np

# build an array using the array function
arr = np.array([0, 9, 5, 4, 3])
arr

Out[7]:

array([0, 9, 5, 4, 3])

Array creation examples¶

There are several functions that are used to create new arrays:

np.array
np.asarray
np.arange
np.ones
np.ones_like
np.zeros
np.zeros_like
np.empty
np.random.randn and other funcs from the random module

In [9]:

np.zeros(4)

Out[9]:

array([ 0.,  0.,  0.,  0.])

In [10]:

np.ones(4)

Out[10]:

array([ 1.,  1.,  1.,  1.])

In [11]:

np.empty(4)

Out[11]:

array([ 0.,  0.,  0.,  0.])

In [12]:

np.arange(4)

Out[12]:

array([0, 1, 2, 3])

dtype and shape¶

NumPy's arrays are containers of homogeneous data, which means all elements are of the same type. The 'dtype' propery is an object that specifies the data type of each element. The 'shape' property is a tuple that indicates the size of each dimension.

In [13]:

arr = np.random.randn(5)
arr

Out[13]:

array([-0.4168, -0.0563, -2.1362,  1.6403, -1.7934])

In [14]:

arr.dtype

Out[14]:

dtype('float64')

In [15]:

arr.shape

Out[15]:

(5,)

In [16]:

# you can be explicit about the data type that you want
np.empty(4, dtype=np.int32)

Out[16]:

array([     0,      0,      0, 131072])

In [17]:

np.array(['numpy','pandas','pytables'], dtype=np.string_)

Out[17]:

array(['numpy', 'pandas', 'pytables'], 
      dtype='|S8')

In [18]:

float_arr = np.array([4.4, 5.52425, -0.1234, 98.1], dtype=np.float64)
# truncate the decimal part
float_arr.astype(np.int32)

Out[18]:

array([ 4,  5,  0, 98])

Indexing and slicing¶

Just what you would expect from Python¶

In [20]:

arr = np.array([0, 9, 1, 4, 64])
arr[3]

Out[20]:

In [21]:

arr[1:3]

Out[21]:

array([9, 1])

In [22]:

arr[:2]

Out[22]:

array([0, 9])

In [23]:

# set the last two elements to 555
arr[-2:] = 55
arr

Out[23]:

array([ 0,  9,  1, 55, 55])

Indexing behaviour for multidimensional arrays¶

A good way to think about indexing in multidimensional arrays is that you are moving along the values of the shape property. So, a 4d array arr_4d, with a shape of (w,x,y,z) will result in indexed views such that:

arr_4d[i].shape == (x,y,z)
arr_4d[i,j].shape == (y,z)
arr_4d[i,j,k].shape == (z,)

For the case of slices, what you are doing is selecting a range of elements along a particular axis:

In [24]:

arr_2d = np.array([[5,3,4],[0,1,2],[1,1,10],[0,0,0.1]])
arr_2d

Out[24]:

array([[  5. ,   3. ,   4. ],
       [  0. ,   1. ,   2. ],
       [  1. ,   1. ,  10. ],
       [  0. ,   0. ,   0.1]])

In [25]:

# get the first row
arr_2d[0]

Out[25]:

array([ 5.,  3.,  4.])

In [26]:

# get the first column
arr_2d[:,0]

Out[26]:

array([ 5.,  0.,  1.,  0.])

In [27]:

# get the first two rows
arr_2d[:2]

Out[27]:

array([[ 5.,  3.,  4.],
       [ 0.,  1.,  2.]])

Careful, it's a view!¶

A slice does not return a copy, which means that any modifications will be reflected in the source array. This is a design feature of NumPy to avoid memory problems.

In [28]:

arr = np.array([0, 3, 1, 4, 64])
arr

Out[28]:

array([ 0,  3,  1,  4, 64])

In [29]:

subarr = arr[2:4]
subarr[1] = 99
arr

Out[29]:

array([ 0,  3,  1, 99, 64])

(Fancy) Boolean indexing¶

Boolean indexing allows you to select data subsets of an array that satisfy a given condition.

In [31]:

arr = np.array([10, 20])
idx = np.array([True, False])
arr[idx]

Out[31]:

array([10])

In [32]:

arr_2d = np.random.randn(5)
arr_2d

Out[32]:

array([-0.8417,  0.5029, -1.2453, -1.058 , -0.909 ])

In [33]:

arr_2d < 0

Out[33]:

array([ True, False,  True,  True,  True], dtype=bool)

In [34]:

arr_2d[arr_2d < 0]

Out[34]:

array([-0.8417, -1.2453, -1.058 , -0.909 ])

In [35]:

arr_2d[(arr_2d > -0.5) & (arr_2d < 0)]

Out[35]:

array([], dtype=float64)

In [36]:

arr_2d[arr_2d < 0] = 0
arr_2d

Out[36]:

array([ 0.    ,  0.5029,  0.    ,  0.    ,  0.    ])

(Fancy) list-of-locations indexing¶

Fancy indexing is indexing with integer arrays.

In [3]:

arr = np.arange(18).reshape(6,3)
arr

Out[3]:

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14],
       [15, 16, 17]])

In [39]:

# fancy selection of rows in a particular order
arr[[0,4,4]]

Out[39]:

array([[ 0,  1,  2],
       [12, 13, 14],
       [12, 13, 14]])

In [9]:

# index into individual elements and flatten
arr[[5,3,1]]

Out[9]:

array([[15, 16, 17],
       [ 9, 10, 11],
       [ 3,  4,  5]])

In [10]:

arr[[5,3,1],[2,1,0]]

Out[10]:

array([17, 10,  3])

In [41]:

# select a submatrix
arr[np.ix_([5,3,1],[2,1])]

Out[41]:

array([[17, 16],
       [11, 10],
       [ 5,  4]])

Vectorization¶

Vectorization is at the heart of NumPy and it enables us to express operations without writing any for loops. Operations between arrays with equal shapes are performed element-wise.

In [44]:

arr = np.array([0, 9, 1.02, 4, 32])
arr - arr

Out[44]:

array([ 0.,  0.,  0.,  0.,  0.])

In [45]:

arr * arr

Out[45]:

array([    0.    ,    81.    ,     1.0404,    16.    ,  1024.    ])

Broadcasting Rules¶

Vectorized operations between arrays of different sizes and between arrays and scalars are subject to the rules of broadcasting. The idea is quite simple in many cases:

In [47]:

arr = np.array([0, 9, 1.02, 4, 64])
5 * arr 

Out[47]:

array([   0. ,   45. ,    5.1,   20. ,  320. ])

In [48]:

10 + arr

Out[48]:

array([ 10.  ,  19.  ,  11.02,  14.  ,  74.  ])

In [49]:

arr ** .5

Out[49]:

array([ 0.  ,  3.  ,  1.01,  2.  ,  8.  ])

The case of arrays of different shapes is slightly more complicated. The gist of it is that the shape of the operands need to conform to a certain specification. Don't worry if this does not make sense right away.

In [51]:

arr = np.random.randn(4,2)
arr

Out[51]:

array([[ 0.5515,  2.2922],
       [ 0.0415, -1.1179],
       [ 0.5391, -0.5962],
       [-0.0191,  1.175 ]])

In [52]:

mean_row = np.mean(arr, axis=0)
mean_row

Out[52]:

array([ 0.2782,  0.4383])

In [53]:

centered_rows = arr - mean_row
centered_rows

Out[53]:

array([[ 0.2732,  1.8539],
       [-0.2367, -1.5562],
       [ 0.2608, -1.0344],
       [-0.2974,  0.7367]])

In [54]:

np.mean(centered_rows, axis=0)

Out[54]:

array([-0.,  0.])

In [55]:

mean_col = np.mean(arr, axis=1)
mean_col

Out[55]:

array([ 1.4218, -0.5382, -0.0286,  0.5779])

In [56]:

centered_cols = arr - mean_col

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-56-bd5236897883> in <module>()
----> 1 centered_cols = arr - mean_col

ValueError: operands could not be broadcast together with shapes (4,2) (4)

In [57]:

# make the 1-D array a column vector
mean_col.reshape((4,1))

Out[57]:

array([[ 1.4218],
       [-0.5382],
       [-0.0286],
       [ 0.5779]])

In [58]:

centered_cols = arr - mean_col.reshape((4,1))
centered_rows

Out[58]:

array([[ 0.2732,  1.8539],
       [-0.2367, -1.5562],
       [ 0.2608, -1.0344],
       [-0.2974,  0.7367]])

In [59]:

centered_cols.mean(axis=1)

Out[59]:

array([-0.,  0.,  0., -0.])

A note about NANs:¶

Per the floating point standard IEEE 754, NaN is a floating point value that, by definition, is not equal to any other floating point value.

In [60]:

np.nan != np.nan

Out[60]:

True

In [61]:

np.array([10,5,4,np.nan,1,np.nan]) == np.nan

Out[61]:

array([False, False, False, False, False, False], dtype=bool)

In [62]:

np.isnan(np.array([10,5,4,np.nan,1,np.nan]))

Out[62]:

array([False, False, False,  True, False,  True], dtype=bool)

pandas: Python Data Analysis Library (30 min)¶

What is it?¶

Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.

The heart of pandas is the DataFrame object for data manipulation. It features:

a powerful index object
data alignment
handling of missing data
aggregation with groupby
data manipuation via reshape, pivot, slice, merge, join

In [68]:

import pandas as pd

pd.set_printoptions(precision=3, notebook_repr_html=True)

Series: labelled arrays¶

The pandas Series is the simplest datastructure to start with. It is a subclass of ndarray that supports more meaninful indices.

Let's look at some creation examples for Series¶

In [11]:

import pandas as pd

values = np.array([2.0, 1.0, 5.0, 0.97, 3.0, 10.0, 0.0599, 8.0])
ser = pd.Series(values)
print ser

0     2.0000
1     1.0000
2     5.0000
3     0.9700
4     3.0000
5    10.0000
6     0.0599
7     8.0000
dtype: float64

In [71]:

values = np.array([2.0, 1.0, 5.0, 0.97, 3.0, 10.0, 0.0599, 8.0])
labels = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
ser = pd.Series(data=values, index=labels)
print ser

A     2.00
B     1.00
C     5.00
D     0.97
E     3.00
F    10.00
G     0.06
H     8.00
dtype: float64

In [5]:

movie_rating = {
    'age': 1,
    'gender': 'F',
    'genres': 'Drama',
    'movie_id': 1193,
    'occupation': 10,
    'rating': 5,
    'timestamp': 978300760,
    'title': "One Flew Over the Cuckoo's Nest (1975)",
    'user_id': 1,
    'zip': '48067'
    }
ser = pd.Series(movie_rating)
print ser

age                                                1
gender                                             F
genres                                         Drama
movie_id                                        1193
occupation                                        10
rating                                             5
timestamp                                  978300760
title         One Flew Over the Cuckoo's Nest (1975)
user_id                                            1
zip                                            48067
dtype: object

In [6]:

ser.index

Out[6]:

Index([u'age', u'gender', u'genres', u'movie_id', u'occupation', u'rating', u'timestamp', u'title', u'user_id', u'zip'], dtype=object)

In [7]:

ser.values

Out[7]:

array([1, F, Drama, 1193, 10, 5, 978300760,
       One Flew Over the Cuckoo's Nest (1975), 1, 48067], dtype=object)

Series indexing¶

In [8]:

ser[0]

Out[8]:

In [9]:

ser['gender']

Out[9]:

'F'

In [10]:

ser.get_value('gender')

Out[10]:

'F'

Operations between Series with different index objects¶

In [12]:

ser_1 = pd.Series(data=[1,3,4], index=['A', 'B', 'C'])
ser_2 = pd.Series(data=[5,5,5], index=['A', 'G', 'C'])
print ser_1 + ser_2

A     6
B   NaN
C     9
G   NaN
dtype: float64

DataFrame¶

The DataFrame is the 2-dimensional version of a Series.

Let's look at some creation examples for DataFrame¶

You can think of it as a spreadsheet whose columns are Series objects.

In [13]:

# build from a dict of equal-length lists or ndarrays
pd.DataFrame({'col_1': [0.12, 7, 45, 10], 'col_2': [0.9, 9, 34, 11]})

Out[13]:

	col_1	col_2
0	0.12	0.9
1	7.00	9.0
2	45.00	34.0
3	10.00	11.0

You can explicitly set the column names and index values as well.

In [15]:

pd.DataFrame(data={'col_1': [0.12, 7, 45, 10], 'col_2': [0.9, 9, 34, 11]},
             columns=['col_1', 'col_2', 'col_3'])

Out[15]:

	col_1	col_2	col_3
0	0.12	0.9	NaN
1	7.00	9.0	NaN
2	45.00	34.0	NaN
3	10.00	11.0	NaN

In [16]:

pd.DataFrame(data={'col_1': [0.12, 7, 45, 10], 'col_2': [0.9, 9, 34, 11]},
             columns=['col_1', 'col_2', 'col_3'],
             index=['obs1', 'obs2', 'obs3', 'obs4'])

Out[16]:

	col_1	col_2	col_3
obs1	0.12	0.9	NaN
obs2	7.00	9.0	NaN
obs3	45.00	34.0	NaN
obs4	10.00	11.0	NaN

You can also think of it as a dictionary of Series objects.

In [17]:

movie_rating = {
    'gender': 'F',
    'genres': 'Drama',
    'movie_id': 1193,
    'rating': 5,
    'timestamp': 978300760,
    'user_id': 1,
    }
ser_1 = pd.Series(movie_rating)
ser_2 = pd.Series(movie_rating)
df = pd.DataFrame({'r_1': ser_1, 'r_2': ser_2})
df.columns.name = 'rating_events'
df.index.name = 'rating_data'
df

Out[17]:

rating_events	r_1	r_2
rating_data
gender	F	F
genres	Drama	Drama
movie_id	1193	1193
rating	5	5
timestamp	978300760	978300760
user_id	1	1

In [18]:

df = df.T
df

Out[18]:

rating_data	gender	genres	movie_id	rating	timestamp	user_id
rating_events
r_1	F	Drama	1193	5	978300760	1
r_2	F	Drama	1193	5	978300760	1

In [19]:

df.columns 

Out[19]:

Index([u'gender', u'genres', u'movie_id', u'rating', u'timestamp', u'user_id'], dtype=object)

In [20]:

df.index

Out[20]:

Index([u'r_1', u'r_2'], dtype=object)

In [21]:

df.values

Out[21]:

array([[F, Drama, 1193, 5, 978300760, 1],
       [F, Drama, 1193, 5, 978300760, 1]], dtype=object)

Adding/Deleting entries¶

In [22]:

df = pd.DataFrame({'r_1': ser_1, 'r_2': ser_2})
df.drop('genres', axis=0)

Out[22]:

	r_1	r_2
rating_data
gender	F	F
movie_id	1193	1193
rating	5	5
timestamp	978300760	978300760
user_id	1	1

In [23]:

df.drop('r_1', axis=1)

Out[23]:

	r_2
rating_data
gender	F
genres	Drama
movie_id	1193
rating	5
timestamp	978300760
user_id	1

In [24]:

# careful with the order here
df['r_3'] = ['F', 'Drama', 1193, 5, 978300760, 1]
df

Out[24]:

	r_1	r_2	r_3
rating_data
gender	F	F	F
genres	Drama	Drama	Drama
movie_id	1193	1193	1193
rating	5	5	5
timestamp	978300760	978300760	978300760
user_id	1	1	1

DataFrame indexing¶

You can index into a column using it's label, or with dot notation

In [26]:

df = pd.DataFrame(data={'col_1': [0.12, 7, 45, 10], 'col_2': [0.9, 9, 34, 11]},
                  columns=['col_1', 'col_2', 'col_3'],
                  index=['obs1', 'obs2', 'obs3', 'obs4'])
df['col_1']

Out[26]:

obs1     0.12
obs2     7.00
obs3    45.00
obs4    10.00
Name: col_1, dtype: float64

In [27]:

df.col_1

Out[27]:

obs1     0.12
obs2     7.00
obs3    45.00
obs4    10.00
Name: col_1, dtype: float64

You can also use multiple columns to select a subset of them:

In [28]:

df[['col_2', 'col_1']]

Out[28]:

	col_2	col_1
obs1	0.9	0.12
obs2	9.0	7.00
obs3	34.0	45.00
obs4	11.0	10.00

The .ix method gives you the most flexibility to index into certain rows, or even rows and columns:

In [29]:

df.ix['obs3']

Out[29]:

col_1     45
col_2     34
col_3    NaN
Name: obs3, dtype: object

In [30]:

df.ix[0]

Out[30]:

col_1    0.12
col_2     0.9
col_3     NaN
Name: obs1, dtype: object

In [31]:

df.ix[:2]

Out[31]:

	col_1	col_2	col_3
obs1	0.12	0.9	NaN
obs2	7.00	9.0	NaN

In [32]:

df.ix[:2, 'col_2']

Out[32]:

obs1    0.9
obs2    9.0
Name: col_2, dtype: float64

In [33]:

df.ix[:2, ['col_1', 'col_2']]

Out[33]:

	col_1	col_2
obs1	0.12	0.9
obs2	7.00	9.0

The MovieLens dataset: loading and first look¶

Loading of the MovieLens dataset here is based on the intro chapter of 'Python for Data Analysis".

The MovieLens data is spread across three files. Using the pd.read_table method we load each file:

In [37]:

import pandas as pd

unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('../data/ml-1m/users.dat',
                      sep='::', header=None, names=unames)

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('../data/ml-1m/ratings.dat',
                        sep='::', header=None, names=rnames)

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('../data/ml-1m/movies.dat',
                       sep='::', header=None, names=mnames)

# show how one of them looks
ratings.head(5)

Out[37]:

	user_id	movie_id	rating	timestamp
0	1	1193	5	978300760
1	1	661	3	978302109
2	1	914	3	978301968
3	1	3408	4	978300275
4	1	2355	5	978824291

What are we going to do next ?¶

Playing with Recommender Systems

References and further reading¶

William Wesley McKinney. Python for Data Analysis. O’Reilly, 2012.

In [19]: