%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# Following is optional: set plotting styles
import seaborn; seaborn.set()
Outline:
Series
and Dataframe
While the Python language is an excellent tool for general-purpose programming, with a highly readable syntax, rich and powerful data types (strings, lists, sets, dictionaries, arbitrary length integers, etc) and a very comprehensive standard library, it was not designed specifically for mathematical and scientific computing. Neither the language nor its standard library have facilities for the efficient representation of multidimensional datasets, tools for linear algebra and general matrix manipulations (an essential building block of virtually all technical computing), nor any data visualization facilities.
In particular, Python lists are very flexible containers that can be nested arbitrarily deep and which can hold any Python object in them, but they are poorly suited to represent efficiently common mathematical constructs like vectors and matrices. In contrast, much of our modern heritage of scientific computing has been built on top of libraries written in the Fortran language, which has native support for vectors and matrices as well as a library of mathematical functions that can efficiently operate on entire arrays at once.
Lists in Python are collections which store values for manipulation:
L = [1, 2, 3, 4, 5]
# Zero-based Indexing
print(L[0], L[1])
1 2
# Indexing from the end
print(L[-1], L[-2])
5 4
# Slicing
L[0:3]
[1, 2, 3]
# The 0 can be left-out
L[:3]
[1, 2, 3]
# Slicing by a step size
L[0:5:2]
[1, 3, 5]
# Reversing with a negative step size
L[::-1]
[5, 4, 3, 2, 1]
# Lists of multiple types
L2 = [1, 'two', 3.14]
# Adding lists together will append them:
L + L2
[1, 2, 3, 4, 5, 1, 'two', 3.14]
list
's flexibility, they are inefficient for storing large amounts of dataimport math
# make a large list of theta values
theta = [0.01 * i for i in range(1000000)]
sin_theta = [math.sin(t) for t in theta]
sin_theta[:10]
[0.0, 0.009999833334166664, 0.01999866669333308, 0.02999550020249566, 0.03998933418663416, 0.04997916927067833, 0.059964006479444595, 0.06994284733753277, 0.0799146939691727, 0.08987854919801104]
%timeit [math.sin(t) for t in theta]
10 loops, best of 3: 140 ms per loop
Let's take a look at doing essentially the same operation using NumPy.
By convention, we'll import numpy
under the shorthand np
:
import numpy as np
theta = 0.01 * np.arange(1E6)
sin_theta = np.sin(theta)
sin_theta[:10]
array([ 0. , 0.00999983, 0.01999867, 0.0299955 , 0.03998933, 0.04997917, 0.05996401, 0.06994285, 0.07991469, 0.08987855])
%timeit np.sin(theta)
100 loops, best of 3: 14.7 ms per loop
NumPy's version of this is nearly 10x faster than the list-based Python version, and it is arguably simpler as well!
There is a lot of info out there about how to use numpy
. Here I want to just briefly go over some of the key concepts as we progress to talking about using Python for real-world data.
There are many, many ways to create Python arrays. We'll demonstrate a few here:
# from a list
np.array([1, 2, 3, 4])
array([1, 2, 3, 4])
# range of numbers, like Python's range()
np.arange(0, 10, 0.5)
array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. , 5.5, 6. , 6.5, 7. , 7.5, 8. , 8.5, 9. , 9.5])
# range of numbers between two limits
np.linspace(0, 10, 5)
array([ 0. , 2.5, 5. , 7.5, 10. ])
# array of zeros
np.zeros(10)
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
# array of ones
np.ones(10)
array([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
# array of random values
np.random.rand(10)
array([ 0.06244294, 0.34856405, 0.83474136, 0.06727073, 0.36735175, 0.45133983, 0.01958005, 0.08204342, 0.75092178, 0.46176409])
Operations on numpy arrays are done element-wise. This means that you don't explicitly have to write for-loops in order to do these operations!
# define some arrays
x = np.arange(5)
y = np.random.random(5)
# addition – add 1 to each
x + 1
array([1, 2, 3, 4, 5])
# multiplication – multiply each by 2
y * 2
array([ 1.51538329, 0.45745721, 1.06958137, 1.83321707, 1.53266941])
# two arrays: everything is element-wise
x / y
array([ 0. , 4.37199364, 3.73978093, 3.27293483, 5.21965137])
# exponentiation
np.exp(x)
array([ 1. , 2.71828183, 7.3890561 , 20.08553692, 54.59815003])
# trigonometric functions
np.sin(x)
array([ 0. , 0.84147098, 0.90929743, 0.14112001, -0.7568025 ])
# combining operations
np.cos(x) + np.sin(2 * np.pi * (x - y))
array([ 1.99883243, -0.45077954, -0.19928728, -0.48967621, 0.34109413])
Indexing works just like with Python lists:
x
array([0, 1, 2, 3, 4])
x[0], x[1]
(0, 1)
x[:3]
array([0, 1, 2])
x[::2]
array([0, 2, 4])
x[::-1]
array([4, 3, 2, 1, 0])
Unlike lists, NumPy arrays can have multiple dimensions, and the indexing and slicing works efficiently!
M = np.arange(20).reshape(4, 5)
M
array([[ 0, 1, 2, 3, 4], [ 5, 6, 7, 8, 9], [10, 11, 12, 13, 14], [15, 16, 17, 18, 19]])
M[1, 2]
7
M[:2, :2]
array([[0, 1], [5, 6]])
M[:, 1:3]
array([[ 1, 2], [ 6, 7], [11, 12], [16, 17]])
Another useful way of indexing arrays is to use masks. If we do a boolean operation on some array, the result is a boolean array:
M
array([[ 0, 1, 2, 3, 4], [ 5, 6, 7, 8, 9], [10, 11, 12, 13, 14], [15, 16, 17, 18, 19]])
M < 8
array([[ True, True, True, True, True], [ True, True, True, False, False], [False, False, False, False, False], [False, False, False, False, False]], dtype=bool)
Boolean mask arrays can be used to select portions of a larger array, and operate on them
M[M < 8] = 0
M
array([[ 0, 0, 0, 0, 0], [ 0, 0, 0, 8, 9], [10, 11, 12, 13, 14], [15, 16, 17, 18, 19]])
M[M == 12] *= 2
M
array([[ 0, 0, 0, 0, 0], [ 0, 0, 0, 8, 9], [10, 11, 24, 13, 14], [15, 16, 17, 18, 19]])
M[M % 2 == 0] = 999
M
array([[999, 999, 999, 999, 999], [999, 999, 999, 999, 9], [999, 11, 999, 13, 999], [ 15, 999, 17, 999, 19]])
As I mentioned, there is much more to numpy arrays, but this has covered the basic pieces needed here!
For data-intensive work in Python the Pandas library has become essential. Pandas originally meant Panel Data, though many users probably don't know that.
Pandas can be thought of as NumPy with built-in labels for rows and columns, but it's also much, much more than that.
Pandas does this through two fundamental object types, both built upon NumPy arrays: the Series
object, and the DataFrame
object.
Series
¶A Series is a basic holder for one-dimensional labeled data. It can be created much as a NumPy array is created:
s = pd.Series([0.1, 0.2, 0.3, 0.4])
The series has a built-in concept of an index, which by default is the numbers 0 through N - 1
s.index
Int64Index([0, 1, 2, 3], dtype='int64')
We can access series values via the index, just like for NumPy arrays:
s[0]
0.10000000000000001
Unlike the NumPy array, though, this index can be something other than integers:
s2 = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])
s2
a 0 b 1 c 2 d 3 dtype: int64
s2['c']
2
In this way, a Series
object can be thought of as similar to an ordered dictionary mapping one typed value to another typed value.
In fact, it's possible to construct a series directly from a Python dictionary:
pop_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
populations = pd.Series(pop_dict)
populations
California 38332521 Florida 19552860 Illinois 12882135 New York 19651127 Texas 26448193 dtype: int64
Note that because Python dictionaries are an unordered object, the order of the resulting series will not match the order of the dictionary definition.
We can index or slice the populations as expected:
populations['California']
38332521
populations['California':'Illinois']
California 38332521 Florida 19552860 Illinois 12882135 dtype: int64
A dataframe, essentially, is a multi-dimensional object to hold labeled data. You can think of it as multiple Series object which share the same index.
One of the most common ways of creating a dataframe is from a dictionary of arrays or lists. Note that in the IPython notebook with the correct settings, the dataframe will display in a rich HTML view:
data = {'state': ['California', 'Texas', 'New York', 'Florida', 'Illinois'],
'population': [38332521, 26448193, 19651127, 19552860, 12882135],
'area':[423967, 695662, 141297, 170312, 149995]}
states = pd.DataFrame(data)
states
area | population | state | |
---|---|---|---|
0 | 423967 | 38332521 | California |
1 | 695662 | 26448193 | Texas |
2 | 141297 | 19651127 | New York |
3 | 170312 | 19552860 | Florida |
4 | 149995 | 12882135 | Illinois |
If we don't like what the index looks like, we can reset it:
states = states.set_index('state')
states
area | population | |
---|---|---|
state | ||
California | 423967 | 38332521 |
Texas | 695662 | 26448193 |
New York | 141297 | 19651127 |
Florida | 170312 | 19552860 |
Illinois | 149995 | 12882135 |
To access a Series representing a column in the data, use typical slicing syntax:
states['area']
state California 423967 Texas 695662 New York 141297 Florida 170312 Illinois 149995 Name: area, dtype: int64
To access a row, you need to use a special row-location operator:
states.loc['California']
area 423967 population 38332521 Name: California, dtype: int64
As you play around with DataFrames, you'll notice that many operations which work on NumPy arrays will also work on dataframes.
For example there's arithmetic. Let's compute the area in square miles and add a column to the data
states['density'] = states['population'] / states['area']
states
area | population | density | |
---|---|---|---|
state | |||
California | 423967 | 38332521 | 90.413926 |
Texas | 695662 | 26448193 | 38.018740 |
New York | 141297 | 19651127 | 139.076746 |
Florida | 170312 | 19552860 | 114.806121 |
Illinois | 149995 | 12882135 | 85.883763 |
We can even use masking the way we did in NumPy:
states[states['density'] > 100]
area | population | density | |
---|---|---|---|
state | |||
New York | 141297 | 19651127 | 139.076746 |
Florida | 170312 | 19552860 | 114.806121 |
And we can do things like sorting the items in the array, and indexing to take the first two rows:
states.sort_index(by='density', ascending=False)[:3]
area | population | density | |
---|---|---|---|
state | |||
New York | 141297 | 19651127 | 139.076746 |
Florida | 170312 | 19552860 | 114.806121 |
California | 423967 | 38332521 | 90.413926 |
One useful method to use is the describe
method, which computes summary statistics for each column:
states.describe()
area | population | density | |
---|---|---|---|
count | 5.000000 | 5.000000 | 5.000000 |
mean | 316246.600000 | 23373367.200000 | 93.639859 |
std | 242437.411951 | 9640385.580443 | 37.672251 |
min | 141297.000000 | 12882135.000000 | 38.018740 |
25% | 149995.000000 | 19552860.000000 | 85.883763 |
50% | 170312.000000 | 19651127.000000 | 90.413926 |
75% | 423967.000000 | 26448193.000000 | 114.806121 |
max | 695662.000000 | 38332521.000000 | 139.076746 |
There are many, many more interesting operations that can be done on Series and DataFrame objects, but rather than continue using this toy data, we'll instead move to a real-world example, and illustrate some of the advanced concepts along the way.
This example is drawn from Wes McKinney's excellent book on the Pandas library, O'Reilly's Python for Data Analysis.
We'll be taking a look at a freely available dataset: the database of names given to babies in the United States over the last century.
First things first, we need to download the data, which can be found at http://www.ssa.gov/oact/babynames/limits.html. If you uncomment the following commands, it will do this automatically (note that these are linux shell commands; they will not work on Windows):
# !curl -O http://www.ssa.gov/oact/babynames/names.zip
# !mkdir -p data/names
# !mv names.zip data/names/
# !cd data/names/ && unzip names.zip
Now we should have a data/names
directory which contains a number of text files, one for each year of data:
!ls data/names
NationalReadMe.pdf yob1913.txt yob1947.txt yob1981.txt yob1880.txt yob1914.txt yob1948.txt yob1982.txt yob1881.txt yob1915.txt yob1949.txt yob1983.txt yob1882.txt yob1916.txt yob1950.txt yob1984.txt yob1883.txt yob1917.txt yob1951.txt yob1985.txt yob1884.txt yob1918.txt yob1952.txt yob1986.txt yob1885.txt yob1919.txt yob1953.txt yob1987.txt yob1886.txt yob1920.txt yob1954.txt yob1988.txt yob1887.txt yob1921.txt yob1955.txt yob1989.txt yob1888.txt yob1922.txt yob1956.txt yob1990.txt yob1889.txt yob1923.txt yob1957.txt yob1991.txt yob1890.txt yob1924.txt yob1958.txt yob1992.txt yob1891.txt yob1925.txt yob1959.txt yob1993.txt yob1892.txt yob1926.txt yob1960.txt yob1994.txt yob1893.txt yob1927.txt yob1961.txt yob1995.txt yob1894.txt yob1928.txt yob1962.txt yob1996.txt yob1895.txt yob1929.txt yob1963.txt yob1997.txt yob1896.txt yob1930.txt yob1964.txt yob1998.txt yob1897.txt yob1931.txt yob1965.txt yob1999.txt yob1898.txt yob1932.txt yob1966.txt yob2000.txt yob1899.txt yob1933.txt yob1967.txt yob2001.txt yob1900.txt yob1934.txt yob1968.txt yob2002.txt yob1901.txt yob1935.txt yob1969.txt yob2003.txt yob1902.txt yob1936.txt yob1970.txt yob2004.txt yob1903.txt yob1937.txt yob1971.txt yob2005.txt yob1904.txt yob1938.txt yob1972.txt yob2006.txt yob1905.txt yob1939.txt yob1973.txt yob2007.txt yob1906.txt yob1940.txt yob1974.txt yob2008.txt yob1907.txt yob1941.txt yob1975.txt yob2009.txt yob1908.txt yob1942.txt yob1976.txt yob2010.txt yob1909.txt yob1943.txt yob1977.txt yob2011.txt yob1910.txt yob1944.txt yob1978.txt yob2012.txt yob1911.txt yob1945.txt yob1979.txt yob2013.txt yob1912.txt yob1946.txt yob1980.txt
Let's take a quick look at one of these files:
!head data/names/yob1880.txt
Each file is just a comma-separated list of names, genders, and counts of babies with that name in each year.
We can load these files using pd.read_csv
, which is specifically designed for this:
names1880 = pd.read_csv('data/names/yob1880.txt')
names1880.head()
Mary | F | 7065 | |
---|---|---|---|
0 | Anna | F | 2604 |
1 | Emma | F | 2003 |
2 | Elizabeth | F | 1939 |
3 | Minnie | F | 1746 |
4 | Margaret | F | 1578 |
Oops! Something went wrong. Our algorithm tried to be smart, and use the first line as index labels. Let's fix this by specifying the index names manually:
names1880 = pd.read_csv('data/names/yob1880.txt',
names=['name', 'gender', 'births'])
names1880.head()
name | gender | births | |
---|---|---|---|
0 | Mary | F | 7065 |
1 | Anna | F | 2604 |
2 | Emma | F | 2003 |
3 | Elizabeth | F | 1939 |
4 | Minnie | F | 1746 |
That looks better. Now we can start playing with the data a bit.
First let's think about how we might count the total number of females and males born in the US in 1880.
If you're used to NumPy, you might be tempted to use masking like this:
First, we can get a mask over all females & males, and then use it to select a subset of the data:
males = names1880[names1880.gender == 'M']
females = names1880[names1880.gender == 'F']
Now we can take the sum of the births for each of these:
males.births.sum(), females.births.sum()
(110491, 90993)
But there's an easier way to do this, using one of Pandas' very powerful features: groupby
:
grouped = names1880.groupby('gender')
grouped
<pandas.core.groupby.DataFrameGroupBy object at 0x10eefb890>
This grouped object is now an abstract representation of the data, where the data is split on the given column. In order to actually do something with this data, we need to specify an aggregation operation to do across the data. In this case, what we want is the sum:
grouped.sum()
births | |
---|---|
gender | |
F | 90993 |
M | 110491 |
We can do other aggregations as well:
grouped.size()
gender F 942 M 1058 dtype: int64
grouped.mean()
births | |
---|---|
gender | |
F | 96.595541 |
M | 104.433837 |
Or, if we wish, we can get a description of the grouping:
grouped.describe()
births | ||
---|---|---|
gender | ||
F | count | 942.000000 |
mean | 96.595541 | |
std | 328.152904 | |
min | 5.000000 | |
25% | 7.000000 | |
50% | 13.000000 | |
75% | 43.750000 | |
max | 7065.000000 | |
M | count | 1058.000000 |
mean | 104.433837 | |
std | 561.232488 | |
min | 5.000000 | |
25% | 7.000000 | |
50% | 12.000000 | |
75% | 41.000000 | |
max | 9655.000000 |
But here we've just been looking at a single year. Let's try to put together all the data in all the years.
To do this, we'll have to use pandas concat
function to concatenate all the data together.
First we'll create a function which loads the data as we did the above data:
def load_year(year):
data = pd.read_csv('data/names/yob{0}.txt'.format(year),
names=['name', 'gender', 'births'])
data['year'] = year
return data
Now let's load all the data into a list, and call pd.concat
on that list:
names = pd.concat([load_year(year) for year in range(1880, 2014)])
names.head()
name | gender | births | year | |
---|---|---|---|---|
0 | Mary | F | 7065 | 1880 |
1 | Anna | F | 2604 | 1880 |
2 | Emma | F | 2003 | 1880 |
3 | Elizabeth | F | 1939 | 1880 |
4 | Minnie | F | 1746 | 1880 |
It looks like we've done it!
Let's start with something easy: we'll use groupby
again to see the total number of births per year:
births = names.groupby('year').births.sum()
births.head()
year 1880 201484 1881 192700 1882 221537 1883 216952 1884 243468 Name: births, dtype: int64
We can use the plot()
method to see a quick plot of these (note that because we used the %matplotlib inline
magic at the start of the notebook, the resulting plot will be shown inline within the notebook).
births.plot();
The so-called "baby boom" generation after the second world war is abundantly clear!
We can also use other aggregates: let's see how many names are used each year:
names.groupby('year').births.count().plot();
Apparently there's been a huge increase of the diversity of names with time!
groupby
can also be used to add columns to the data: think of it as a view of the data that you're modifying. Let's add a column giving the frequency of each name within each year & gender:
def add_frequency(group):
group['birth_freq'] = group.births / group.births.sum()
return group
names = names.groupby(['year', 'gender']).apply(add_frequency)
names.head()
name | gender | births | year | birth_freq | |
---|---|---|---|---|---|
0 | Mary | F | 7065 | 1880 | 0.077643 |
1 | Anna | F | 2604 | 1880 | 0.028618 |
2 | Emma | F | 2003 | 1880 | 0.022013 |
3 | Elizabeth | F | 1939 | 1880 | 0.021309 |
4 | Minnie | F | 1746 | 1880 | 0.019188 |
Notice that the apply()
function iterates over each group, and calls a function which modifies the group.
This result is then re-constructed into a container which looks ike the original dataframe.
Next we'll discuss Pivot Tables, which are an even more powerful way of (re)organizing your data.
Let's say that we want to plot the men and women separately. We could do this by using masking, as follows:
men = names[names.gender == 'M']
women = names[names.gender == 'W']
And then we could proceed as above, using groupby
to group on the year.
But we would end up with two different views of the data. A better way to do this is to use a pivot_table
, which is essentially a groupby in multiple dimensions at once:
births = names.pivot_table('births',
index='year', columns='gender',
aggfunc=sum)
births.head()
gender | F | M |
---|---|---|
year | ||
1880 | 90993 | 110491 |
1881 | 91954 | 100746 |
1882 | 107850 | 113687 |
1883 | 112322 | 104630 |
1884 | 129022 | 114446 |
Note that this has grouped the index by the value of year
, and grouped the columns by the value of gender
.
Let's plot the results now:
births.plot(title='Total Births');
Some names have shifted from being girls names to being boys names. Let's take a look at some of these:
names_to_check = ['Allison', 'Alison']
# filter on just the names we're interested in
births = names[names.name.isin(names_to_check)]
# pivot table to get year vs. gender
births = births.pivot_table('births', index='year', columns='gender')
# fill all NaNs with zeros
births = births.fillna(0)
# normalize along columns
births = births.div(births.sum(1), axis=0)
births.plot(title='Fraction of babies named Allison');
We can see that prior to about 1905, all babies named Allison were male. Over the 20th century, this reversed, until the end of the century nearly all Allisons were female!
There's some noise in this data: we can smooth it out a bit by using a 5-year rolling mean:
pd.rolling_mean(births, 5).plot(title="Allisons: 5-year moving average");
This gives a smoother picture of the transition, and is an example of the bias/variance tradeoff that we'll often see in modeling: a smoother model has less variance (variation due to sampling or other noise) but at the expense of more bias (the model systematically mis-represents the data slightly).
We'll discuss this type of tradeoff more in coming sessions.
We've just scratched the surface of what can be done with Pandas, but we'll get a chance to play with this more in the breakout session coming up.
For more information on using Pandas, check out the pandas documentation or the book Python for Data Analysis by Pandas creator Wes McKinney.