From the pandas homepage:
What problem does pandas solve?
Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.
pandas is probably overkill for what we want to do, but it does make the process of handling and organizing data, and then turning into visualizations, less tedious. Its data structures are based off of structures fundamental and familiar to plain Python: lists and dictionaries.
Pandas was created by Wes McKinney, who also authored Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, which I recommend and also quote from in this guide.
Here's a quick 10-minute tour of pandas from McKinney:
The official pandas homepage is pandas.pydata.org. Its docs live here.
Sections most relevant to us:
Greg Reda has an excellent tutorial that covers roughly the same scope of concepts I'm cover here. Check them out for another perspective:
pandas
package¶For the rest of this lesson, assume that pandas has been imported into the environment like so:
import pandas as pd
In virtually every tutorial, including this one, you'll see the convention of pd
being used as shorthand for pandas
.
pandas.Series
data structure¶Pandas docs: pandas.Series
The pandas.Series
structure is pretty similar to a list, in that it holds an ordered list of values:
import pandas as pd
myseries = pd.Series(['alpha', 1, 2, 3, 'zeta'])
print("The first value is:", myseries[0])
# The first value is: alpha
However, you can also think of a Series
object as a dictionary, in that its elements can also be indexed by keys. Via Wes McKinney's Python for Data Analysis:
Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be substituted into many functions that expect a dict.
Series
object¶Since Series
is so similar to a dictionary, we can construct a new Series
object by passing in a dictionary to its constructor function:
mydict = {'gamma': 42, 'beta': 30, 'delta' : 101}
myseries = pd.Series(mydict)
print("This is gamma:", myseries['gamma'])
print("This is the 3rd element:", myseries[2])
This is gamma: 42 This is the 3rd element: 42
As indicated in the above example, when creating the Series
from a dictionary, pandas will sort the keys by default. Unlike a typical dictionary, this ordering will be enforced, as it would be in a list.
The concept of an ordered index is central to Python lists and for pandas Series as well. In lieu of constructing a Series using a dictionary, you can create a new Series by passing in a list of values and then specifying the index argument.
Note in the example below how the Series will not sort the index alphabetically and will keep it in the same order as we've specified in the index argument:
mylist = [42, 30, 10]
myseries = pd.Series(mylist, index = ['gamma', 'beta', 'delta'])
print("This is gamma:", myseries['gamma'])
print("This is the 3rd element:", myseries[2])
This is gamma: 42 This is the 3rd element: 10
As seen above, referring to the rows of a Series uses the same bracket []
notation as used for lists and dictionaries. However, there are a few more options in selecting multiple values when working with Series.
While using an individual index/key will return the single corresponding value, e.g. index[0]
gets you 42
, the following examples of multiple indexes/keys will create new Series objects:
myseries = pd.Series([42, 30, 10, 99], index = ['gamma', 'beta', 'delta', 'alpha'])
myseries[[0, 2]]
gamma 42 delta 10 dtype: int64
myseries[['alpha', 'beta']]
alpha 99 beta 30 dtype: int64
A note about the output above: The pandas
data structures come with metadata; the dtype: int64
refers to the fact that every value in the resulting Series objects from the above commands is an integer.
myseries = pd.Series([42, 30, 10, 99], index = ['gamma', 'beta', 'delta', 'alpha'])
myseries[2:]
delta 10 alpha 99 dtype: int64
However, we can also slice a Series by its index values:
myseries['gamma':'beta']
gamma 42 beta 30 dtype: int64
A couple of things to note:
beta
is also included in the resulting sub-Series.Assuming that everything in a given Series is of a single type, we can perform mass operations (better known as scalar and array operations) on that Series, creating a new Series object in the process.
Here's how to mass-assign the string 'fluffy'
to the same range of rows accessed in the previous example:
myseries[2:] = 'fluffy'
myseries
gamma 42 beta 30 delta fluffy alpha fluffy dtype: object
Note how the dtype
of myseries
changed to object
; the series no longer contains just integers. Although Series are just like lists and dictionaries in that they can contain a sequence of any type of objects, we normally strive to keep them all of one type in most data-wrangling/analysis scenarios.
Here's an example of scalar arithmetic, e.g. multiplying all of the elements in the Series with a single value:
myseries = pd.Series([42, 30, 10, 99], index = ['gamma', 'beta', 'delta', 'alpha'])
myseries * 10
gamma 420 beta 300 delta 100 alpha 990 dtype: int64
And here's an example of array arithmetic, e.g. adding elements of a Series with corresponding elements from another sequence (i.e. a list or Series) to produce a new Series:
myseries + [1000, 2000, 3000, -4000]
gamma 1042 beta 2030 delta 3010 alpha -3901 dtype: int64
Attempting to perform array operations against differently-sized sequences will result in an error:
myseries + [1000, 2000]
Boolean expressions are also possible. The following command creates a new Series object that contains the result of the boolean expression as it is applied to each row:
myseries > 20
gamma True beta True delta False alpha True dtype: bool
The pandas Series objects come with a variety of useful and familiar aggregation methods and attributes:
myseries = pd.Series({'a': 20, 'b': -5, 'c': 42})
myseries.size
3
myseries.sum()
57
myseries.mean()
19.0
myseries.min()
-5
One of the biggest advantages for us in using pandas data structures are their many methods for filtering data.
Consider how we filter a dictionary in plain Python:
mydict = {'apples': 42, 'bagels': 9, 'carrots': 303, 'dates': 7}
newdict = {}
for k, v in mydict.items():
if v > 10:
newdict[k] = v
Or if you prefer using a comprehension:
{k: v for k, v in mydict.items() if v > 10}
{'apples': 42, 'carrots': 303}
Filtering a Series can be done using the same bracket []
notation used in indexing a Series:
myseries = pd.Series({'apples': 42, 'bagels': 9, 'carrots': 303, 'dates': 7})
myseries[myseries > 10]
apples 42 carrots 303 dtype: int64
It can't be emphasized enough that while a pandas Series is as accessible as a Python dictionary, it still maintains its specified order of keys, i.e. its index, and the values they map to.
When creating a Series from a Python list, its index is simply the numerical position of each element. And in creating a Series from a dictionary, its index are the keys of the dictionary:
series_a = pd.Series([4, 10, 3])
series_a.index
Int64Index([0, 1, 2], dtype='int64')
series_b = pd.Series({'x': 500, 'y': 600, 'z': 700})
series_b.index
Index(['x', 'y', 'z'], dtype='object')
The index=
method can be used to change the index, effectively relabeling the data, by substituting a list of new index values (the Series and the list of new index values have to be the same size):
myseries = pd.Series([4, 10, 3])
myseries.index = ['a', 'b', 'c']
myseries.index
Index(['a', 'b', 'c'], dtype='object')
The reindex()
method allows you to rearrange the values of a Series; note that it does not change the Series object but creates a new one:
myseries = pd.Series({'apples': 42, 'bagels': 9, 'carrots': 303})
myseries.reindex(['bagels', 'apples', 'carrots'])
bagels 9 apples 42 carrots 303 dtype: int64
Reindexing the series using values not found in the original index will return a Series that contains the new index values pointing to null/NaN
values:
myseries.reindex(['oranges', 'bagels', 'apples', 'carrots'])
oranges NaN bagels 9 apples 42 carrots 303 dtype: float64
pandas.Dataframes
structures¶The DataFrame structure can be thought of as series of Series objects, except that they share two indices: the index that represents the row order, and the index that represents the columns order. In other words, it's pretty much like a spreadsheet.
You can create a DataFrame by passing in a list-of-lists, i.e. a 2-dimensional array. The more common route is to use the read_csv() method, which can read from a local filename or, very conveniently, a URL:
# data from Yahoo Finance Historical Prices:
# http://finance.yahoo.com/q/hp?s=AAPL
csvurl = "http://real-chart.finance.yahoo.com/table.csv?s=AAPL&a=00&b=1&c=2013&d=04&e=1&f=2015&g=d"
prices = pd.read_csv(csvurl, parse_dates = [0])
prices.head()
Date | Open | High | Low | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|---|
0 | 2015-05-01 | 126.10000 | 130.13000 | 125.30000 | 128.95000 | 57195000 | 128.95000 |
1 | 2015-04-30 | 128.64000 | 128.64000 | 124.58000 | 125.15000 | 82475900 | 125.15000 |
2 | 2015-04-29 | 130.16000 | 131.59000 | 128.30000 | 128.64000 | 62410800 | 128.64000 |
3 | 2015-04-28 | 134.46001 | 134.53999 | 129.57001 | 130.56000 | 118580700 | 130.56000 |
4 | 2015-04-27 | 132.31000 | 133.13000 | 131.14999 | 132.64999 | 84783100 | 132.64999 |
In the read_csv()
call, I make use of the parse_dates
argument and specify that the first column contains a date that should be converted to a proper timestamp object. Again, something you could write a for-loop with the datetime.strptime()
function, but a very handy conveneince that pandas
brings to the table.
By default, loading data via a CSV file (or via any other format besides a dictionary) will create a DataFrame in which the index is a sequential list of integers. But, as a DataFrame is a group of series, we can actually specify that one of its columns be the index.
In the following snippet, I create a new copy of the prices
DataFrame, but using the Date
column as the index:
dprices = prices.set_index(prices['Date'])
dprices.head()
Date | Open | High | Low | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|---|
Date | |||||||
2015-05-01 | 2015-05-01 | 126.10000 | 130.13000 | 125.30000 | 128.95000 | 57195000 | 128.95000 |
2015-04-30 | 2015-04-30 | 128.64000 | 128.64000 | 124.58000 | 125.15000 | 82475900 | 125.15000 |
2015-04-29 | 2015-04-29 | 130.16000 | 131.59000 | 128.30000 | 128.64000 | 62410800 | 128.64000 |
2015-04-28 | 2015-04-28 | 134.46001 | 134.53999 | 129.57001 | 130.56000 | 118580700 | 130.56000 |
2015-04-27 | 2015-04-27 | 132.31000 | 133.13000 | 131.14999 | 132.64999 | 84783100 | 132.64999 |
The practical implication is that I can now use a date string as a key to get rows:
dprices['2015-04-27']
Date | Open | High | Low | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|---|
Date | |||||||
2015-04-27 | 2015-04-27 | 132.31 | 133.13 | 131.14999 | 132.64999 | 84783100 | 132.64999 |
I've made a separate walkthrough using more Yahoo Finance data with data frames here.
For more thorough walkthroughs with DataFrames, check out Greg Reda's work: