cd ~/swc/pandas-data/
/home/zonca/swc/pandas-data
pwd
u'/home/zonca/swc/pandas-data'
Let's start with a hypothetical problem we want to solve. We are interested in understanding the relationship between the weather and the number of mosquitos occuring in a particular year so that we can plan mosquito control measures accordingly. Since we want to apply these mosquito control measures at a number of different sites we need to understand both the relationship at a particular site and whether or not it is consistent across sites. The data we have to address this problem comes from the local government and are stored in tables in comma-separated values (CSV) files. Each file holds the data for a single location, each row holds the information for a single year at that location, and the columns hold the data on both mosquito numbers and the average temperature and rainfall from the beginning of mosquito breeding season. The first few rows of our first file look like:
year,temperature,rainfall,mosquitos
2001,87,222,198
2002,72,103,105
2003,77,176,166
In order to load the data, we need to import a library called Pandas that knows how to operate on tables of data.
import pandas
We can now use Pandas to read our data file.
pandas.read_csv('A1_mosquito_data.csv', sep=',')
year | temperature | rainfall | mosquitos | |
---|---|---|---|---|
0 | 2001 | 80 | 157 | 150 |
1 | 2002 | 85 | 252 | 217 |
2 | 2003 | 86 | 154 | 153 |
3 | 2004 | 87 | 159 | 158 |
4 | 2005 | 74 | 292 | 243 |
5 | 2006 | 75 | 283 | 237 |
6 | 2007 | 80 | 214 | 190 |
7 | 2008 | 85 | 197 | 181 |
8 | 2009 | 74 | 231 | 200 |
9 | 2010 | 74 | 207 | 184 |
The read_csv()
function belongs to the pandas
library. In order to run it we need to tell Python that it is part of pandas
and we do this using the dot notation, which is used everywhere in Python to refer to parts of larger things.
When we are finished typing and press Shift+Enter, the notebook runs our command and shows us its output. In this case, the output is the data we just loaded.
The first column on the left is the index
column, a way for pandas
to label each row, by default it just uses integers.
However in this case we have a better way of indexing the data, we can use the year
column. We can use the index_col
keyword argument to read_csv()
to specify the index column:
pandas.read_csv('A1_mosquito_data.csv', sep=',', index_col='year')
temperature | rainfall | mosquitos | |
---|---|---|---|
year | |||
2001 | 80 | 157 | 150 |
2002 | 85 | 252 | 217 |
2003 | 86 | 154 | 153 |
2004 | 87 | 159 | 158 |
2005 | 74 | 292 | 243 |
2006 | 75 | 283 | 237 |
2007 | 80 | 214 | 190 |
2008 | 85 | 197 | 181 |
2009 | 74 | 231 | 200 |
2010 | 74 | 207 | 184 |
Our call to pandas.read_csv()
read data into memory, but didn't save it anywhere. To do that, we need to assign the array to a variable.
data = pandas.read_csv('A1_mosquito_data.csv', sep=',', index_col='year')
This statement doesn't produce any output because assignment doesn't display anything. If we want to check that our data has been loaded, we can print the variable's value:
print(data)
temperature rainfall mosquitos year 2001 80 157 150 2002 85 252 217 2003 86 154 153 2004 87 159 158 2005 74 292 243 2006 75 283 237 2007 80 214 190 2008 85 197 181 2009 74 231 200 2010 74 207 184
print data
tells Python to display the text. Alternatively we could just include data
as the last value in a code cell:
data
temperature | rainfall | mosquitos | |
---|---|---|---|
year | |||
2001 | 80 | 157 | 150 |
2002 | 85 | 252 | 217 |
2003 | 86 | 154 | 153 |
2004 | 87 | 159 | 158 |
2005 | 74 | 292 | 243 |
2006 | 75 | 283 | 237 |
2007 | 80 | 214 | 190 |
2008 | 85 | 197 | 181 |
2009 | 74 | 231 | 200 |
2010 | 74 | 207 | 184 |
This tells the IPython Notebook to display the data
object, which is why we see a pretty formated table.
Once we have imported the data we can start doing things with it. First, let's ask what type of thing data
refers to:
print(type(data))
<class 'pandas.core.frame.DataFrame'>
The data is stored in a data structure called a DataFrame. There are other kinds of data structures that are also commonly used in scientific computing including Numpy arrays, and Numpy matrices, which can be used for doing linear algebra.
We can select an individual column of data using its name:
print(data['temperature'])
year 2001 80 2002 85 2003 86 2004 87 2005 74 2006 75 2007 80 2008 85 2009 74 2010 74 Name: temperature, dtype: int64
Or we can select several columns of data at once:
print(data[['rainfall', 'temperature']])
rainfall temperature year 2001 157 80 2002 252 85 2003 154 86 2004 159 87 2005 292 74 2006 283 75 2007 214 80 2008 197 85 2009 231 74 2010 207 74
data.index
Int64Index([2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010], dtype='int64')
We can also select subsets of rows using slicing. Say we just want the first two rows of data:
print(data[:2])
temperature rainfall mosquitos year 2001 80 157 150 2002 85 252 217
However most of the time we want to index using the index
!
data.ix[2004:2008]
temperature | rainfall | mosquitos | |
---|---|---|---|
year | |||
2004 | 87 | 159 | 158 |
2005 | 74 | 292 | 243 |
2006 | 75 | 283 | 237 |
2007 | 80 | 214 | 190 |
2008 | 85 | 197 | 181 |
data.ix[2004:2008]["temperature"]
year 2004 87 2005 74 2006 75 2007 80 2008 85 Name: temperature, dtype: int64
data.ix[2004]
temperature 87 rainfall 159 mosquitos 158 Name: 2004, dtype: int64
We can also subset the data based on the value of other rows:
data['temperature'][data['mosquitos'] > 200]
year 2002 85 2005 74 2006 75 Name: temperature, dtype: int64
Print the mosquitos number when temperature is more than 75 degrees between 2005 and 2008
data['mosquitos'][data['temperature'] > 75].ix[2005:2008]
year 2007 190 2008 181 Name: mosquitos, dtype: int64
data['mosquitos'].ix[2005:2008][data['temperature'] > 75]
year 2007 190 2008 181 Name: mosquitos, dtype: int64
Data frames also know how to perform common mathematical operations on their values. If we want to find the average value for each variable, we can just ask the data frame for its mean values
print(data.mean())
temperature 80.0 rainfall 214.6 mosquitos 191.3 dtype: float64
Data frames have lots of useful methods:
print(data.max())
temperature 87 rainfall 292 mosquitos 243 dtype: int64
print(data['temperature'].min())
74
print(data['mosquitos'][1:3].std())
45.2548339959
abs(-1) # call function
data.max() # call method on an object
data[:3] # slicing by row
data["temperature"] # get a column of a data frame "slice a column"
dataslice = data.ix[2004:2008] # this is a special case, we are slicing ix
Import the data from A2_mosquito_data.csv
, create a new variable that holds a data frame with only the weather data, and print the means and standard deviations for the weather variables.
%matplotlib inline
from matplotlib import pyplot as plt
Now, let's make a simple plot showing how the number of mosquitos varies over time. We'll use the site you've been doing exercises with since it has a longer time-series.
data = pandas.read_csv('A2_mosquito_data.csv', index_col="year")
data['mosquitos'].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7f76348b1190>
Pandas can automatically plot a complete table:
data.plot(figsize=(13,8), marker='s')
<matplotlib.axes._subplots.AxesSubplot at 0x7f7634811210>
import statsmodels.api as sm
regr_results = sm.OLS.from_formula('mosquitos ~ temperature', data).fit()
regr_results.summary()
Dep. Variable: | mosquitos | R-squared: | 0.004 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | -0.017 |
Method: | Least Squares | F-statistic: | 0.1735 |
Date: | Sat, 21 Feb 2015 | Prob (F-statistic): | 0.679 |
Time: | 21:01:26 | Log-Likelihood: | -259.30 |
No. Observations: | 51 | AIC: | 522.6 |
Df Residuals: | 49 | BIC: | 526.5 |
Df Model: | 1 |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 154.4695 | 74.077 | 2.085 | 0.042 | 5.605 303.334 |
temperature | 0.3827 | 0.919 | 0.417 | 0.679 | -1.464 2.229 |
Omnibus: | 8.529 | Durbin-Watson: | 1.854 |
---|---|---|---|
Prob(Omnibus): | 0.014 | Jarque-Bera (JB): | 2.660 |
Skew: | -0.072 | Prob(JB): | 0.265 |
Kurtosis: | 1.891 | Cond. No. | 1.07e+03 |
line_fit = regr_results.params['Intercept'] + regr_results.params['temperature'] * data['temperature']
plt.plot(data['temperature'], data['mosquitos'], '.', label="data")
plt.plot(data['temperature'], line_fit, 'red', label="fit")
plt.xlabel('temperature')
plt.ylabel('mosquitos')
plt.legend()
<matplotlib.legend.Legend at 0x7f76208a0c90>
Perform the line fit and the plot for rainfall
instead.