Extracts from an early draft of a series of notebooks developed for the forthcoming Open University course TM351, codename "The Data Course", due for release in October 2015.
# Warning suppression - for python 3 users...
# pandas is throwing quite a lot of warnings at the moment becuase not everything currently works in python 3.4
# Uncomment the following two lines and run this block to suppress warnings
#import warnings
#warnings.filterwarnings('ignore')
Whilst Python is a general purpose scripting language, when working in a particular domain - such as the world of data - it can be useful to make use of code libraries that provide higher level data structures and operations that are suited to that domain.
A library we will be drawing on heavily in this course is pandas. If data is represented in an appropriate format, it can make it much easier to work with. panads provides just such a medium. In this notebook, you will learn about two key pandas data structures: lists and dataframes.
To start with we need to load in the pandas library. By convention we associate this with the convenience label pd.
import pandas as pd
Python lists are flexible, mutable data structures that can be used to repreent an ordered list of objects.
simple_list=['apples', 'oranges', 'bananas','pears']
Associated with each list is a numerical index value, with a count starting at zero, that identifies the position of each list member. The N'th list member has index value N-1.
print("First list item (index value 1-1 = 0): simple_list[0] -> {0}".format(simple_list[0]))
print("Third list item (index value 3-1 = 2): simple_list[2] -> {0}".format(simple_list[2]))
Python also supports unordered associative arrays in the form of dicts that allow you to index a value by name:
simple_dict={1:'one', 2:'two', 4:'four', 3:'three'}
print( "Item with key (index) '4' has value: {0}".format(simple_dict[4]))
We can inspect the keys and the values contained within a dict directly:
print( list(simple_dict.keys()) )
print( list(simple_dict.values()) )
With a simple list or dict, we can use a list or dict comprehension to filter the contents of an object according to a test condition. For example, we can test against the value of the keys in a dict and generate a list containing associated dict values.
[simple_dict[k] for k in simple_dict if k>2]
#The following statement swaps the keys and values contained in simple_dict
alternative_dict= dict(zip(simple_dict.values(),simple_dict.keys()))
#How would you generate a list containing the keys of alternative_dict for dict values < 3?
##YOUR CODE HERE
In the following sections, you will be introduced to two powerful data representations supported by the pandas library: Series and DataFrames. The introduction provides a quick overview of some of the operations that are possible using these dataframes. We will be revisiting many of the operations in more depth later in the course, so for now just try to get a feel for what's possible...
A pandas Series combines the idea of a list with an additional index column:
pd.Series( ['one', 'two', 'three', 'four'] )
We can simplify the calling of Series by loading the Series method in from the pandas library explicitly.
from pandas import Series
simple_series = Series( [1, 2, 3, 4] )
simple_series
We can index into a Series using the corresponding index value:
simple_series[1]
We can also grab several values at once if we pass the desired index values in as a list in the order we want them to be displayed:
simple_series[ [1,0,3] ]
In much the same way that we can inspect the keys used in a Python dict, we can inspect the index values used within a Series.
simple_series.index
We can also define our own index values:
myindex_series = Series( [1, 2, 3, 4], index=['one', 'two', 'three', 'four'] )
print( myindex_series.index )
myindex_series
Again, it's easy enough to pull out several values from the Series by providing several of our own index values in a list:
myindex_series[ ['two','four'] ]
Series can also be created from a simple dict, where the unique key values become index values:
Series( {'Q1':'Spring', 'Q2':'Summer','Q3':'Autumn', 'Q4':'Winter'} )
Series( {'Q1':'Spring', 'Q2':'Summer','Q3':'Autumn', 'Q4':'Winter'}, index=['Q4','Q3','Q2','Q1'] )
Construct a series containing the names of the days of the week and a day number index, ordering the days Monday to Sunday with day count 1 to 7.
#YOUR CODE HERE
In certain respects, we can think of a Series as a vector. For example, if we multiply a Series by a scalar value, each member of the series is multiplied by that value:
5 * Series([1,1,2,3,4])
Series([1,1,2,3,4]) - 7.1
Series([1,1,2,3,4]) + Series([10,10,15,-15,-20])
A very useful feature of Series is that we can filter their values by value. The values in the Series (not the index values) are tested against the condition and the series elements that pass the test are returned, along with their index values:
myindex_series[ myindex_series>2 ]
This takes a little bit of thinking about... Let's see what the index expression is returning:
myindex_series>2
Graphically, we might think of this operation in the following terms:
Assign your "days of the week" Series to a variable daysOfWeek. Find a way of creating a new Series containing just the weekdays by filtering daysOfWeek.
#YOUR CODE HERE
In many situtations, it can be useful to be able add Series together that share some common index values, even if the Series are presented in a different order. For example, imagine you have one series that represents the total amount of expenditure spent with a range of companies across several projects, and separate series containing the value of payments made to just those companies who worked on a particular project:
totalSpend=Series({"Company A":0,"Company B":0, "Company C":0, "Company D":0})
totalSpend
project1=Series([1000, 2000, 500], index=["Company A","Company B", "Company C"])
project2=Series([800, 2000 ], index=["Company D", "Company A"])
project2
What happens if we try adding various combinations of totalSpend, project1 and project2?
totalSpend+project1
project1+project2
If any index values aren't matched across the Series, the sum for that index value returns NaN - not a number. However, using the Series add method we can force missing values to be treated as a particular value, such as 0, using a fill_value parameter:
totalSpend.add(project1,fill_value=0)
As the result is itself a Series, we can also chain the add expression:
totalSpend.add(project1,fill_value=0).add(project2,fill_value=0)
DataFrames are two dimensional data tables in which rows of data have values spread across one or more columns, much like a sheet in a spreadsheet. Each column behaves as if it is a Series; a DataFrame can thus be thought of as a dict of Series, where dict keys correspond to column names. That said, the spreadsheet idea is probably easier to desrcibe to non-Python programmers!
from pandas import DataFrame
courseData= { 'courseCode': ['TM351','TU100','M269'],
'points':[30,60,30],
'level':['3','1','2']
}
DataFrame( courseData )
We can also create a DataFrame from a list of tuples.
DataFrame([('a',1,2),('b',2,3)], columns=["alpha",'num1','num2'])
We can pull out any required column as a Series by using the column name as a key value:
course_df=DataFrame( courseData )
course_df['courseCode']
course_df.courseCode
To pull out several columns, provide a list of the column names you want to extract:
course_df[ ['courseCode','level'] ]
We can force the ordering of columns in the DataFrame by means of the columns variable. If a column name is specified that does not have a match in the keys of the source data dict, an 'empty' column is created with all the values set to NaN.
course_df=DataFrame( courseData, columns=['level','courseCode','title','points'] )
course_df
We can populate a column with the same value in each cell using a 'scalar' assignment:
course_df.title='Unknown'
course_df
We can also set column values from a Series:
course_df.title = Series( ['The data course','The foundation course','The algorithms course'] )
course_df
We can pull out the values of a column in a similar way:
course_df.title
EXERCISE: What problems can you foresee with setting the values within a column from a Series? Does the original assignment of data to the dataframe from the coursedata dict suffer from similar problems and if so how? How might you get around such problems?
Here's what I thought...
If the Series elements are in a different order to the DataFrame index order, the titles will be incorrectly assigned to the course code indexed rows. A similar problem exists in the original dict - we rely on the order of items in each independent list to be in synch with the order of all the other lists. One way round the problem would be to put the values for each row in a tuple and label the columns explictly.
One of the reasons for using a library such as pandas is that it offers data structures that make working with data as easy as possible.
Sometimes, this requires a little careful thinking about how best to organise the data within a DataFrame.
For example, for the course information dataframe, it may be most useful to use the course codes as the index values.
course_df=course_df.set_index('courseCode')
course_df
The visual presentation of this is slightly misleading - it looks as if there may be a blank row with index value courseCode, but the size of the trasnformed table is correctly shown; one of the columns in the original DataFrame has moved from being a column in its own right to the index column.
We can pull out one or more rows by referencing the appropriate index element(s) and the columns we wish to extract:
course_df.ix[ ['TU100','TM351'], ['title','level'] ]
If we identify a particular column, we can use an index value to pull out the value from the correspondingly indexed row and the chosen column:
course_df.level['TM351']
course_df[ 'title' ]['TM351']
We can also filter the rows of a DataFrame based on the vlaues of one or more columns. We will cover this powerful feature in more depth later in the course.
course_df[course_df.points==30]
You may recall that passing a column name in as the DataFrame key returns a Series containing the values of the column.
course_df['title']
Paraphrasing the author of pandas, Wes McKinney, writing in his book Python for Data Analysis, "[this apparent inconsistency] in syntax arose out of practicalityand nothing more". Pragmatic programming, as you might, FTW! [FTW is slang for "for the win", a positive exclamation in gamer culture.]
We can pull out the unique values contained with a column very easily by applying the unique()
function to the appropriate column:
course_df['points'].unique()
We can easily iterate through the items:
for item in course_df['points'].unique():
print(item)
To get a list rather than an array, make use of the tolist()
helper function:
course_df['points'].unique().tolist()
In order to perform certain operations on a column, the column needs to be correctly typed - that is, all elements need to be of a particular type.
You can check the type of each column in a dataframe using the .dtypes
attribute:
course_df.dtypes
Here we see that the level and title are classed as object types, but the points values have been identified as integers.
pandas' Series and DataFrame objects provide several more useful, and powerful, data manipulation methods than have been described here. You will meet many of them later in the course, as and when they are required.
If you would like to learn about pandas in more depth, a copy of the book written by pandas' original developer, Wes McKinney, is available to OU students as an ebook via the OU Library: Python for Data Analysis