In [1]:

cd ~/swc/pandas-data/

/home/zonca/swc/pandas-data

In [2]:

pwd

Out[2]:

u'/home/zonca/swc/pandas-data'

Introduction¶

Let's start with a hypothetical problem we want to solve. We are interested in understanding the relationship between the weather and the number of mosquitos occuring in a particular year so that we can plan mosquito control measures accordingly. Since we want to apply these mosquito control measures at a number of different sites we need to understand both the relationship at a particular site and whether or not it is consistent across sites. The data we have to address this problem comes from the local government and are stored in tables in comma-separated values (CSV) files. Each file holds the data for a single location, each row holds the information for a single year at that location, and the columns hold the data on both mosquito numbers and the average temperature and rainfall from the beginning of mosquito breeding season. The first few rows of our first file look like:

year,temperature,rainfall,mosquitos
2001,87,222,198
2002,72,103,105
2003,77,176,166

Objectives¶

Conduct variable assignment, looping, and conditionals in Python
Use an external Python library
Read tabular data from a file
Subset and perform analysis on data
Display simple graphs

Loading Data¶

In order to load the data, we need to import a library called Pandas that knows how to operate on tables of data.

In [3]:

import pandas

We can now use Pandas to read our data file.

In [4]:

pandas.read_csv('A1_mosquito_data.csv', sep=',')

Out[4]:

	year	temperature	rainfall	mosquitos
0	2001	80	157	150
1	2002	85	252	217
2	2003	86	154	153
3	2004	87	159	158
4	2005	74	292	243
5	2006	75	283	237
6	2007	80	214	190
7	2008	85	197	181
8	2009	74	231	200
9	2010	74	207	184

The read_csv() function belongs to the pandas library. In order to run it we need to tell Python that it is part of pandas and we do this using the dot notation, which is used everywhere in Python to refer to parts of larger things.

When we are finished typing and press Shift+Enter, the notebook runs our command and shows us its output. In this case, the output is the data we just loaded.

The first column on the left is the index column, a way for pandas to label each row, by default it just uses integers. However in this case we have a better way of indexing the data, we can use the year column. We can use the index_col keyword argument to read_csv() to specify the index column:

In [5]:

pandas.read_csv('A1_mosquito_data.csv', sep=',', index_col='year')

Out[5]:

	temperature	rainfall	mosquitos
year
2001	80	157	150
2002	85	252	217
2003	86	154	153
2004	87	159	158
2005	74	292	243
2006	75	283	237
2007	80	214	190
2008	85	197	181
2009	74	231	200
2010	74	207	184

Our call to pandas.read_csv() read data into memory, but didn't save it anywhere. To do that, we need to assign the array to a variable.

In [6]:

data = pandas.read_csv('A1_mosquito_data.csv', sep=',', index_col='year')

This statement doesn't produce any output because assignment doesn't display anything. If we want to check that our data has been loaded, we can print the variable's value:

In [7]:

print(data)

      temperature  rainfall  mosquitos
year                                  
2001           80       157        150
2002           85       252        217
2003           86       154        153
2004           87       159        158
2005           74       292        243
2006           75       283        237
2007           80       214        190
2008           85       197        181
2009           74       231        200
2010           74       207        184

print data tells Python to display the text. Alternatively we could just include data as the last value in a code cell:

In [8]:

data

Out[8]:

	temperature	rainfall	mosquitos
year
2001	80	157	150
2002	85	252	217
2003	86	154	153
2004	87	159	158
2005	74	292	243
2006	75	283	237
2007	80	214	190
2008	85	197	181
2009	74	231	200
2010	74	207	184

This tells the IPython Notebook to display the data object, which is why we see a pretty formated table.

Manipulating data¶

Once we have imported the data we can start doing things with it. First, let's ask what type of thing data refers to:

In [9]:

print(type(data))

<class 'pandas.core.frame.DataFrame'>

The data is stored in a data structure called a DataFrame. There are other kinds of data structures that are also commonly used in scientific computing including Numpy arrays, and Numpy matrices, which can be used for doing linear algebra.

We can select an individual column of data using its name:

In [10]:

print(data['temperature'])

year
2001    80
2002    85
2003    86
2004    87
2005    74
2006    75
2007    80
2008    85
2009    74
2010    74
Name: temperature, dtype: int64

Or we can select several columns of data at once:

In [11]:

print(data[['rainfall', 'temperature']])

      rainfall  temperature
year                       
2001       157           80
2002       252           85
2003       154           86
2004       159           87
2005       292           74
2006       283           75
2007       214           80
2008       197           85
2009       231           74
2010       207           74

In [12]:

data.index

Out[12]:

Int64Index([2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010], dtype='int64')

We can also select subsets of rows using slicing. Say we just want the first two rows of data:

In [13]:

print(data[:2])

      temperature  rainfall  mosquitos
year                                  
2001           80       157        150
2002           85       252        217

However most of the time we want to index using the index!

In [14]:

data.ix[2004:2008]

Out[14]:

	temperature	rainfall	mosquitos
year
2004	87	159	158
2005	74	292	243
2006	75	283	237
2007	80	214	190
2008	85	197	181

In [15]:

data.ix[2004:2008]["temperature"]

Out[15]:

year
2004    87
2005    74
2006    75
2007    80
2008    85
Name: temperature, dtype: int64

In [16]:

data.ix[2004]

Out[16]:

temperature     87
rainfall       159
mosquitos      158
Name: 2004, dtype: int64

We can also subset the data based on the value of other rows:

In [17]:

data['temperature'][data['mosquitos'] > 200]

Out[17]:

year
2002    85
2005    74
2006    75
Name: temperature, dtype: int64

Challenge¶

Print the mosquitos number when temperature is more than 75 degrees between 2005 and 2008

In [18]:

data['mosquitos'][data['temperature'] > 75].ix[2005:2008]

Out[18]:

year
2007    190
2008    181
Name: mosquitos, dtype: int64

In [19]:

data['mosquitos'].ix[2005:2008][data['temperature'] > 75]

Out[19]:

year
2007    190
2008    181
Name: mosquitos, dtype: int64

Data frames also know how to perform common mathematical operations on their values. If we want to find the average value for each variable, we can just ask the data frame for its mean values

In [20]:

print(data.mean())

temperature     80.0
rainfall       214.6
mosquitos      191.3
dtype: float64

Data frames have lots of useful methods:

In [21]:

print(data.max())

temperature     87
rainfall       292
mosquitos      243
dtype: int64

In [22]:

print(data['temperature'].min())

In [23]:

print(data['mosquitos'][1:3].std())

45.2548339959

Summary on how to use parentheses¶

In [24]:

abs(-1) # call function
data.max() # call method on an object

data[:3] # slicing by row
data["temperature"] # get a column of a data frame "slice a column"
dataslice = data.ix[2004:2008] # this is a special case, we are slicing ix

Challenge¶

Import the data from A2_mosquito_data.csv, create a new variable that holds a data frame with only the weather data, and print the means and standard deviations for the weather variables.

Plotting¶

In [25]:

%matplotlib inline

In [26]:

from matplotlib import pyplot as plt

Now, let's make a simple plot showing how the number of mosquitos varies over time. We'll use the site you've been doing exercises with since it has a longer time-series.

In [27]:

data = pandas.read_csv('A2_mosquito_data.csv', index_col="year")
data['mosquitos'].plot()

Out[27]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f76348b1190>

Pandas can automatically plot a complete table:

In [28]:

data.plot(figsize=(13,8), marker='s')

Out[28]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f7634811210>

Fitting a line through the data¶

In [29]:

import statsmodels.api as sm

regr_results = sm.OLS.from_formula('mosquitos ~ temperature', data).fit()
regr_results.summary()

Out[29]:

OLS Regression Results
Dep. Variable:	mosquitos	R-squared:	0.004
Model:	OLS	Adj. R-squared:	-0.017
Method:	Least Squares	F-statistic:	0.1735
Date:	Sat, 21 Feb 2015	Prob (F-statistic):	0.679
Time:	21:01:26	Log-Likelihood:	-259.30
No. Observations:	51	AIC:	522.6
Df Residuals:	49	BIC:	526.5
Df Model:	1

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	154.4695	74.077	2.085	0.042	5.605 303.334
temperature	0.3827	0.919	0.417	0.679	-1.464 2.229

Omnibus:	8.529	Durbin-Watson:	1.854
Prob(Omnibus):	0.014	Jarque-Bera (JB):	2.660
Skew:	-0.072	Prob(JB):	0.265
Kurtosis:	1.891	Cond. No.	1.07e+03

In [30]:

line_fit = regr_results.params['Intercept'] + regr_results.params['temperature'] * data['temperature']
plt.plot(data['temperature'], data['mosquitos'], '.', label="data")
plt.plot(data['temperature'], line_fit, 'red', label="fit")
plt.xlabel('temperature')
plt.ylabel('mosquitos')
plt.legend()

Out[30]:

<matplotlib.legend.Legend at 0x7f76208a0c90>

Challenge¶

Perform the line fit and the plot for rainfall instead.

In [30]: