In [1]:
cd ~/swc/pandas-data/
/home/zonca/swc/pandas-data
In [2]:
pwd
Out[2]:
u'/home/zonca/swc/pandas-data'

Introduction

Let's start with a hypothetical problem we want to solve. We are interested in understanding the relationship between the weather and the number of mosquitos occuring in a particular year so that we can plan mosquito control measures accordingly. Since we want to apply these mosquito control measures at a number of different sites we need to understand both the relationship at a particular site and whether or not it is consistent across sites. The data we have to address this problem comes from the local government and are stored in tables in comma-separated values (CSV) files. Each file holds the data for a single location, each row holds the information for a single year at that location, and the columns hold the data on both mosquito numbers and the average temperature and rainfall from the beginning of mosquito breeding season. The first few rows of our first file look like:

year,temperature,rainfall,mosquitos
2001,87,222,198
2002,72,103,105
2003,77,176,166

Objectives

  • Conduct variable assignment, looping, and conditionals in Python
  • Use an external Python library
  • Read tabular data from a file
  • Subset and perform analysis on data
  • Display simple graphs

Loading Data

In order to load the data, we need to import a library called Pandas that knows how to operate on tables of data.

In [3]:
import pandas

We can now use Pandas to read our data file.

In [4]:
pandas.read_csv('A1_mosquito_data.csv', sep=',')
Out[4]:
year temperature rainfall mosquitos
0 2001 80 157 150
1 2002 85 252 217
2 2003 86 154 153
3 2004 87 159 158
4 2005 74 292 243
5 2006 75 283 237
6 2007 80 214 190
7 2008 85 197 181
8 2009 74 231 200
9 2010 74 207 184

The read_csv() function belongs to the pandas library. In order to run it we need to tell Python that it is part of pandas and we do this using the dot notation, which is used everywhere in Python to refer to parts of larger things.

When we are finished typing and press Shift+Enter, the notebook runs our command and shows us its output. In this case, the output is the data we just loaded.

The first column on the left is the index column, a way for pandas to label each row, by default it just uses integers. However in this case we have a better way of indexing the data, we can use the year column. We can use the index_col keyword argument to read_csv() to specify the index column:

In [5]:
pandas.read_csv('A1_mosquito_data.csv', sep=',', index_col='year')
Out[5]:
temperature rainfall mosquitos
year
2001 80 157 150
2002 85 252 217
2003 86 154 153
2004 87 159 158
2005 74 292 243
2006 75 283 237
2007 80 214 190
2008 85 197 181
2009 74 231 200
2010 74 207 184

Our call to pandas.read_csv() read data into memory, but didn't save it anywhere. To do that, we need to assign the array to a variable.

In [6]:
data = pandas.read_csv('A1_mosquito_data.csv', sep=',', index_col='year')

This statement doesn't produce any output because assignment doesn't display anything. If we want to check that our data has been loaded, we can print the variable's value:

In [7]:
print(data)
      temperature  rainfall  mosquitos
year                                  
2001           80       157        150
2002           85       252        217
2003           86       154        153
2004           87       159        158
2005           74       292        243
2006           75       283        237
2007           80       214        190
2008           85       197        181
2009           74       231        200
2010           74       207        184

print data tells Python to display the text. Alternatively we could just include data as the last value in a code cell:

In [8]:
data
Out[8]:
temperature rainfall mosquitos
year
2001 80 157 150
2002 85 252 217
2003 86 154 153
2004 87 159 158
2005 74 292 243
2006 75 283 237
2007 80 214 190
2008 85 197 181
2009 74 231 200
2010 74 207 184

This tells the IPython Notebook to display the data object, which is why we see a pretty formated table.

Manipulating data

Once we have imported the data we can start doing things with it. First, let's ask what type of thing data refers to:

In [9]:
print(type(data))
<class 'pandas.core.frame.DataFrame'>

The data is stored in a data structure called a DataFrame. There are other kinds of data structures that are also commonly used in scientific computing including Numpy arrays, and Numpy matrices, which can be used for doing linear algebra.

We can select an individual column of data using its name:

In [10]:
print(data['temperature'])
year
2001    80
2002    85
2003    86
2004    87
2005    74
2006    75
2007    80
2008    85
2009    74
2010    74
Name: temperature, dtype: int64

Or we can select several columns of data at once:

In [11]:
print(data[['rainfall', 'temperature']])
      rainfall  temperature
year                       
2001       157           80
2002       252           85
2003       154           86
2004       159           87
2005       292           74
2006       283           75
2007       214           80
2008       197           85
2009       231           74
2010       207           74
In [12]:
data.index
Out[12]:
Int64Index([2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010], dtype='int64')

We can also select subsets of rows using slicing. Say we just want the first two rows of data:

In [13]:
print(data[:2])
      temperature  rainfall  mosquitos
year                                  
2001           80       157        150
2002           85       252        217

However most of the time we want to index using the index!

In [14]:
data.ix[2004:2008]
Out[14]:
temperature rainfall mosquitos
year
2004 87 159 158
2005 74 292 243
2006 75 283 237
2007 80 214 190
2008 85 197 181
In [15]:
data.ix[2004:2008]["temperature"]
Out[15]:
year
2004    87
2005    74
2006    75
2007    80
2008    85
Name: temperature, dtype: int64
In [16]:
data.ix[2004]
Out[16]:
temperature     87
rainfall       159
mosquitos      158
Name: 2004, dtype: int64

We can also subset the data based on the value of other rows:

In [17]:
data['temperature'][data['mosquitos'] > 200]
Out[17]:
year
2002    85
2005    74
2006    75
Name: temperature, dtype: int64

Challenge

Print the mosquitos number when temperature is more than 75 degrees between 2005 and 2008

In [18]:
data['mosquitos'][data['temperature'] > 75].ix[2005:2008]
Out[18]:
year
2007    190
2008    181
Name: mosquitos, dtype: int64
In [19]:
data['mosquitos'].ix[2005:2008][data['temperature'] > 75]
Out[19]:
year
2007    190
2008    181
Name: mosquitos, dtype: int64

Data frames also know how to perform common mathematical operations on their values. If we want to find the average value for each variable, we can just ask the data frame for its mean values

In [20]:
print(data.mean())
temperature     80.0
rainfall       214.6
mosquitos      191.3
dtype: float64

Data frames have lots of useful methods:

In [21]:
print(data.max())
temperature     87
rainfall       292
mosquitos      243
dtype: int64
In [22]:
print(data['temperature'].min())
74
In [23]:
print(data['mosquitos'][1:3].std())
45.2548339959

Summary on how to use parentheses

In [24]:
abs(-1) # call function
data.max() # call method on an object

data[:3] # slicing by row
data["temperature"] # get a column of a data frame "slice a column"
dataslice = data.ix[2004:2008] # this is a special case, we are slicing ix

Challenge

Import the data from A2_mosquito_data.csv, create a new variable that holds a data frame with only the weather data, and print the means and standard deviations for the weather variables.

Plotting

In [25]:
%matplotlib inline
In [26]:
from matplotlib import pyplot as plt

Now, let's make a simple plot showing how the number of mosquitos varies over time. We'll use the site you've been doing exercises with since it has a longer time-series.

In [27]:
data = pandas.read_csv('A2_mosquito_data.csv', index_col="year")
data['mosquitos'].plot()
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f76348b1190>

Pandas can automatically plot a complete table:

In [28]:
data.plot(figsize=(13,8), marker='s')
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7634811210>

Fitting a line through the data

In [29]:
import statsmodels.api as sm

regr_results = sm.OLS.from_formula('mosquitos ~ temperature', data).fit()
regr_results.summary()
Out[29]:
OLS Regression Results
Dep. Variable: mosquitos R-squared: 0.004
Model: OLS Adj. R-squared: -0.017
Method: Least Squares F-statistic: 0.1735
Date: Sat, 21 Feb 2015 Prob (F-statistic): 0.679
Time: 21:01:26 Log-Likelihood: -259.30
No. Observations: 51 AIC: 522.6
Df Residuals: 49 BIC: 526.5
Df Model: 1
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 154.4695 74.077 2.085 0.042 5.605 303.334
temperature 0.3827 0.919 0.417 0.679 -1.464 2.229
Omnibus: 8.529 Durbin-Watson: 1.854
Prob(Omnibus): 0.014 Jarque-Bera (JB): 2.660
Skew: -0.072 Prob(JB): 0.265
Kurtosis: 1.891 Cond. No. 1.07e+03
In [30]:
line_fit = regr_results.params['Intercept'] + regr_results.params['temperature'] * data['temperature']
plt.plot(data['temperature'], data['mosquitos'], '.', label="data")
plt.plot(data['temperature'], line_fit, 'red', label="fit")
plt.xlabel('temperature')
plt.ylabel('mosquitos')
plt.legend()
Out[30]:
<matplotlib.legend.Legend at 0x7f76208a0c90>

Challenge

Perform the line fit and the plot for rainfall instead.

In [30]: