Before we get into the details of flow analysis, it makes sense to get used to the tools we'll be using. This notebook is designed to introduce IPython and Pandas for general data analysis tasks. This forms a basis for doing flow analysis work in this environment.
This notebook is (very lightly) adapted from Chapters 0 through 4 of the "Pandas Cookbook" by Julia Evans; all credit for the content here goes to her (and when you read "I" here, that's Ms. Evans speaking).
This tour is designed to be run in interactive mode, using IPython notebooks. If you're not already viewing the tutorial using IPython's notebook mode, get IPython notebook installed and, and start it from a terminal by running
ipython notebook
First, we need to explain how to run cells. Try to run the cell below!
import pandas as pd
print("Hi! This is a cell. Press the ▶ button above to run it")
You can also run a cell with Ctrl+Enter or Shift+Enter. Experiment a bit with that.
One of the most useful things about IPython notebook is its tab completion.
Try this: click just after read_csv(
in the cell below and press Shift+Tab 4 times, slowly
pd.read_csv(
After the first time, you should see this:
After the second time:
After the fourth time, a big help box should pop up at the bottom of the screen, with the full documentation for the read_csv
function:
I find this amazingly useful. I think of this as "the more confused I am, the more times I should press Shift+Tab". Nothing bad will happen if you tab complete 12 times.
Okay, let's try tab completion for function names!
pd.r
You should see this:
Writing code in the notebook is pretty normal.
def print_10_nums():
for i in range(10):
print(i)
print_10_nums()
As of the latest stable version, the notebook autosaves. You should use the latest stable version. Really.
IPython has all kinds of magic functions. Here's an example of comparing sum()
with a list comprehension to a generator comprehension using the %time
magic.
%time sum([x for x in range(100000)])
%time sum(x for x in range(100000))
The magics I use most are %time
and %prun
for profiling. You can run %magic
to get a list of all of them, and %quickref
for a reference sheet.
%quickref
Now let's get started with using Pandas. First, run the following code to set up the environment in IPython: