This material assumes that you have programmed before. This first lecture provides a quick introduction to programming in Python for those who either haven't used Python before or need a quick refresher.
Let's start with a hypothetical problem we want to solve. We are interested in understanding the relationship between the weather and the number of mosquitos occuring in a particular year so that we can plan mosquito control measures accordingly. Since we want to apply these mosquito control measures at a number of different sites we need to understand both the relationship at a particular site and whether or not it is consistent across sites. The data we have to address this problem comes from the local government and are stored in tables in comma-separated values (CSV) files. Each file holds the data for a single location, each row holds the information for a single year at that location, and the columns hold the data on both mosquito numbers and the average temperature and rainfall from the beginning of mosquito breeding season. The first few rows of our first file look like:
year,temperature,rainfall,mosquitos
2001,87,222,198
2002,72,103,105
2003,77,176,166
There are many ways to read in data. The most basic reads in each line as a string.
ofile = open('mosquito_data_A1.csv', 'r')
print ofile.read()
year,temperature,rainfall,mosquitos 2001,87,222,198 2002,72,103,105 2003,77,176,166 2004,89,236,210 2005,88,283,242 2006,89,151,147 2007,71,121,117 2008,88,267,232 2009,85,211,191 2010,75,101,106
An easier way to read files, especially in a uniform format is to use a package called numpy which incidentially is also used for array and matrix manipulation. Python has some built in abilities, but much of the flexibility of python comes from the ability to load different packages. To load a package:
import numpy as np
We will now use numpy to read our file. Many text reading options exist including pandas and astropy.io.ascii.
np.genfromtxt('mosquito_data_A1.csv', unpack = True, skiprows = 1, delimiter = ',')
array([[ 2001., 2002., 2003., 2004., 2005., 2006., 2007., 2008., 2009., 2010.], [ 87., 72., 77., 89., 88., 89., 71., 88., 85., 75.], [ 222., 103., 176., 236., 283., 151., 121., 267., 211., 101.], [ 198., 105., 166., 210., 242., 147., 117., 232., 191., 106.]])
The genfromtext
function belongs to the numpy
library. In order to run it we need to tell Python that it is part of numpy
and we do this using the dot notation, which is used everywhere in Python to refer to parts of larger things.
When we are finished typing and press Shift+Enter, the notebook runs our command and shows us its output. In this case, the output is the data we just loaded.
We gave the genfromtxt function a few different pieces of information.
First we gave it the name of the file to read. Notice this is the only information that does not have a word = X format. That means this is required and has no default value.
Next we said unpack = True, this means we want each column to go to a separate array, not each line.
skiprows = 1: there is text in the first row, so we want to skip it
delimiter = ',': since this is a csv file, rows are separated by commas. The default for numpy.genfromtxt is tab separated.
Our call to numpy.genfromtxt
read data into memory, but didn't save it anywhere. To do that, we need to assign the array to a variable. In Python we use =
to assign a new value to a variable like this:
d = np.genfromtxt('mosquito_data_A1.csv', skiprows = 1, delimiter = ',', unpack = True)
This statement doesn't produce any output because assignment doesn't display anything. If we want to check that our data has been loaded, we can print the variable's value:
print d
[[ 2001. 2002. 2003. 2004. 2005. 2006. 2007. 2008. 2009. 2010.] [ 87. 72. 77. 89. 88. 89. 71. 88. 85. 75.] [ 222. 103. 176. 236. 283. 151. 121. 267. 211. 101.] [ 198. 105. 166. 210. 242. 147. 117. 232. 191. 106.]]
print d
tells Python to display the text. Alternatively we could just include data
as the last value in a code cell:
A word on variable names. The variable name I've chosen for the data doesn't really communicate any information to anyone about what it's holding, which means that when I come back to my code next month to change something I'm going to have a more difficult time understanding what the code is actually doing. This brings us to one of our first major lessons for the morning, which is that in order to understand what our code is doing so that we can quickly make changes in the future, we need to write code for people, not computers, and an important first step is to use meaningful varible names.
data = np.genfromtxt('mosquito_data_A1.csv', skiprows = 1, delimiter = ',', unpack = True)
Let's deconstruct that statement:
Let's talk a little more about unpacking
a, b, c = [1, 2, 3]
print a
print b
1 2
Let's unpack data
year, temperature, rainfall, mosquitos = data
print year
[ 2001. 2002. 2003. 2004. 2005. 2006. 2007. 2008. 2009. 2010.]
Once we have imported the data we can start doing things with it. First, let's ask what type of thing data
refers to:
print type(temperature)
<type 'numpy.ndarray'>
type() asks the variable temperature what variable type it is (float, interger, list, dictionary, ...) temperature is a numpy array. What can you do to a numpy array?
print temperature * 4
[ 348. 288. 308. 356. 352. 356. 284. 352. 340. 300.]
print temperature + temperature
[ 174. 144. 154. 178. 176. 178. 142. 176. 170. 150.]
in general operations are done element wise in an array. There is also a matrix object if you need to do linear algebra. Another very common type is a list.
my_list = [1, 2, 3, 'a', 'c']
print type(my_list)
<type 'list'>
Let's try the same operations on a list
print my_list * 4
[1, 2, 3, 'a', 'c', 1, 2, 3, 'a', 'c', 1, 2, 3, 'a', 'c', 1, 2, 3, 'a', 'c']
SURPRISE! This repeated the list 4 times.
print my_list + [4]
[1, 2, 3, 'a', 'c', 4]
Here the + acted as an append.
Differences between list and arrays:
Notice I cheated a little. What if we just try to add a number?
print my_list + 4
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-16-6008aad3738f> in <module>() ----> 1 print my_list + 4 TypeError: can only concatenate list (not "int") to list
We got an error message.
Notice how helpful this error message is:
so far we have printed the whole array - but what if we only want part of an array? We index
print year
print year[0]
print year[2:5]
Important things to notice:
print the first 3 elements
print year[:3]
print from the 3rd element to the end
print year[2:]
print the last 3 elements
print year[-3:]
This is awesome because you don't have to know how long your array is
Python is an object oriented language which means that everything is an object. Objects know things about themselves and can sometimes perform simple operations on themselves. What does that mean in practice?
x = 2
dir(x)
Here x is just an integer, but it knows how to add with other object (add), you can ask it its bit_length, conjugate, etc. So it is not just an integer, it is an object
print x.bit_length()
You will notice that some of these methods and attributes have __ or _ and some do not. In general the ones with nothing are for common use and the ones with _ and __ you won't need unless you are trying to do something really funky - like redefine what integer addition means.
So what can you do to arrays?
print temperature.mean()
arrays have lots of useful methods. In addition to the dir() function, in ipython you can also type variable.
print data.
print year.min()
You can even operate on a piece of an array
print mosquitos[1:3].std()
Import the data from mosquito_data_A2.csv
, create new variables that hold the arrays, and print the means and standard deviations for the weather variables (rainfall and temperature).
year, temperature, rainfall, mosquitos = np.genfromtxt('mosquito_data_A1.csv', skiprows = 1, delimiter = ',', unpack = True)
print temperature.mean()
print temperature.std()
print rainfall.mean()
print rainfall.std()
Once we have some data we often want to be able to loop over it to perform the same operation repeatedly.
A for
loop in Python takes the general form
for item in list:
do_something
So if we want to loop over the temperatures and print out there values in degrees Celcius (instead of Farenheit) we can use:
for temp_in_f in temperature:
temp_in_c = (temp_in_f - 32) * 5 / 9.0
print temp_in_c
That looks good, but why did we use 9.0 instead of 9? The reason is that computers store integers and numbers with decimals as different types: integers and floating point numbers (or floats). Addition, subtraction and multiplication work on both as we'd expect, but division works differently. If we divide one integer by another, we get the quotient without the remainder:
print '10/3 is:', 10 / 3
If either part of the division is a float, on the other hand, the computer creates a floating-point answer:
print '10/3.0 is:', 10 / 3.0
The computer does this for historical reasons: integer operations were much faster on early machines, and this behavior is actually useful in a lot of situations. However, it's still confusing, so Python 3 produces a floating-point answer when dividing integers if it needs to. We're still using Python 2.7 in this class, so if we want 5/9 to give us the right answer, we have to write it as 5.0/9, 5/9.0, or some other variation.
The other standard thing we need to know how to do in Python is conditionals, or if/then/else statements. In Python the basic syntax is:
if condition:
do_something
So if we want to loop over the temperatures and print out only those temperatures that are greater than 80 degrees we would use:
if temperature[0] > 80:
print "The temperature is greater than 80"
We can also use ==
for equality, <=
for less than or equal to, >=
for greater than or equal to, and !=
for not equal to.
Additional conditions can be handled using elif
and else
:
if temperature[0] < 87:
print "The temperature is < 87"
elif temperature[0] > 87:
print "The temperature is > 87"
else:
print " The temperature is equal to 87"
Import the data from mosquito_data_A2.csv
and loop over the temperature values. For each temperature print out whether it is greater than the mean, less than the mean, or equal to the mean.
year, temperature, rainfall, mosquitos = np.genfromtxt('mosquito_data_A1.csv', skiprows = 1, delimiter = ',', unpack = True)
for temp in temperature:
if temp > temperature.mean():
print 'temp is greater than mean'
elif temp < temperature.mean():
print 'temp is less than mean'
else:
print 'temp is equal to mean'
import libraryname
.numpy
library to work with data tables in Python.variable = value
to assign a value to a variable.print something
to display the value of something
.array[start_row:stop_row]
to select rows from a data frame.array.mean()
and array.min()
to calculate simple statistics.for x in list:
to loop over valuesif condition:
to make conditional decisionsNow that we've covered some of the basic syntax and libraries in Python we can start to tackle our data analysis problem. We are interested in understanding the relationship between the weather and the number of mosquitos so that we can plan mosquito control measures. Since we want to apply these mosquito control measures at a number of different sites we need to understand how the relationship varies across sites. Remember that we have a series of CSV files with each file containing the data for a single location.
When approaching computational tasks like this one it is typically best to start small, check each piece of code as you go, and make incremental changes. This helps avoid marathon debugging sessions because it's much easier to debug one small piece of the code at a time than to write 100 lines of code and then try to figure out all of the different bugs in it.
Let's start by reading in the data from a single file and conducting a simple regression analysis on it.
year, temperature, rainfall, mosquitos = np.genfromtxt('mosquito_data_A1.csv', skiprows = 1, delimiter = ',', unpack = True)
What does our data look like?
from matplotlib import pyplot
In this case pyplot is a submodule of matplotlib. I know that everything I want to do is in pyplot, so instead of typing matplotlib.pyplot for everything, I can import this way and just type pyplot
pyplot.plot(temperature, mosquitos)
That looks pretty ugly, let's make points instead of lines
pyplot.cla()
pyplot.plot(temperature, mosquitos, 'o')
Ohhh, that looks like a straight line, let's see if we can fit it using the numpy polyfit function
fit_coeff = np.polyfit(temperature, mosquitos, 1)
Here I am giving polyfit the independent variable (rainfall), the dependent variable (mosquitos), and the order of the polynomial to fit (here 1 since its a line). This returns the coefficients of the fit starting with the highest order first
print fit_coeff
Now we can plot the fit
fit_to_data = fit_coeff[0] * temperature + fit_coeff[1]
pyplot.plot(temperature, fit_to_data)
Interactive plotting:
So now I've done a few things interactively:
This could all go into a single function but that has a few disadvantages
Think of breaking your program into steps, just like we did above
Functions are the paragraphs of programming and the function name is the topic sentence, just like variables these should be very descriptive. Functions take the general form:
def function_name:
do stuff
return result
We'll start with some pseudo code
def read_csv_file():
'''
This code will read in a CSV file of year, temperature, rainfall, and number of mosquitos and return 4 arrays, one for each column
'''
pass
def convert_fahrenheit_to_celsius():
'''
This code will convert an array of tempertures from fahrenheit to celsius
'''
pass
def plot_data():
'''
This code will plot the arrays in x and y with symbol (default "o")
'''
pass
This is pseudo code. It doesn't do anything, but I now have a skeleton of my program and I just have to fill in the information.
I used triple quotes, aka doc string to describe the code. There is a help function that can be called on any function which gives you information on that function. By default the help function returns the doc string.
help(plot_data)
help(np.polyfit)
my main code will look like this:
read_csv_file()
convert_fahrenheit_to_celsius()
plot_data()
Now let's fill in the details, starting with the first function
def read_csv_file(filename):
'''
This code will read in a CSV file of year, temperature, rainfall, and number of mosquitos and return 4 arrays, one for each column
'''
year, temperature, rainfall, mosquitos = np.genfromtxt(filename, skiprows = 1, delimiter = ',', unpack = True)
return year, temperature, rainfall, mosquitos
def convert_fahrenheit_to_celsius():
'''
This code will convert an array of tempertures from fahrenheit to celsius
'''
pass
def plot_data():
'''
This code will plot the arrays in x and y with symbol (default "o")
'''
pass
Let's test our first function
read_csv_file('mosquito_data_A1.csv')
That looks good, but it is just returning arays that are not assigned to variables outside the function so we can't use them. Let's assign them
year, temperature, rainfall, mosquitos = read_csv_file('mosquito_data_A1.csv')
print year
Excellent, now let's convert to fahenheit
def read_csv_file(filename):
'''
This code will read in a CSV file of year, temperature, rainfall, and number of mosquitos and return 4 arrays, one for each column
'''
year, temperature, rainfall, mosquitos = np.genfromtxt(filename, skiprows = 1, delimiter = ',', unpack = True)
return year, temperature, rainfall, mosquitos
def convert_fahrenheit_to_celsius(temp_in_f):
'''
This code will convert an array of tempertures from fahrenheit to celsius
'''
temp_in_c = (temp_in_f - 32) * 5 / 9.0
return temp_in_c
def plot_data():
'''
This code will plot the arrays in x and y with symbol (default "o")
'''
pass
year, temperature, rainfall, mosquitos = read_csv_file('mosquito_data_A1.csv')
temp_in_c = convert_fahrenheit_to_celsius(temperature)
print temperature, temp_in_c
Write the plotting function
def plot_data(x, y, symbol = 'o'):
'''
This code will plot the arrays in x and y with symbol (default "o")
'''
pyplot.plot(x, y, symbol)
Notice this function does not have to return anything
One final command: to save a figure from the command line (i.e. not using the GUI)
pyplot.savefig('temp_vs_mosquitos.pdf')
pyplot.close()
savefig will save to the current directory unless you give it a path. It is smart enough to figure out your file format from the extension of the file you save. You can also save png files, jpg files, ps files, etc.
Let's add this to our plot_data function
def plot_data(x, y, symbol = 'o'):
'''
This code will plot the arrays in x and y with symbol (default "o")
'''
pyplot.plot(x, y, symbol)
pyplot.savefig('temp_vs_mosquitos.pdf')
pyplot.close()
Finally, let's run our whole program
year, temperature, rainfall, mosquitos = read_csv_file('mosquito_data_A1.csv')
temp_in_c = convert_fahrenheit_to_celsius(temperature)
plot_data()
Show if name == "main":
Repeat:
import
run from shell