Pandas¶

(If you're using the code files, please open pandas_lessons.py)

The importance of data preprocessing¶

Data preprocessing (also called data wrangling, cleaning, scrubbing, etc) is the most important thing you will do with your data because it sets the stage for the analysis part of your data analysis workflow. The preprocessing you do largely depends on what kind of data you have, what sort of analysis you'll be doing with your data, and what you intend to do with the results.

Preprocessing is also a process for getting to know your data, and can answer questions such as these (and more):

What kind of data are you working with?
Is it categorical, continuous, or a mix of both?
What's the distribution of features in your dataset?
What sort of wrangling do you have to do?
Do you have any missing data?
Do you need to remove missing data?
Do you need only a subset of your data?
Do you need more data?
Or less?

The questions you'll have to answer are, again, dependent upon the data that you're working with, and preprocessing can be a way to figure that out.

What is Pandas?¶

Pandas is by far my favorite preprocessing tool. It's a data wrangling/modeling/analysis tool that is similar to R and Excel; in fact, the DataFrame data structure in Pandas was named after the DataFrame in R. Pandas comes with several easy-to-use data structures, two of which (the Series and the DataFrame) I'll be covering here.

I'll also be covering a bunch of different wrangling tools, as well as a couple of analysis tools.

Why Pandas?¶

So, why would you want to use Python, as opposed to tools like R and Excel? I like to use it because I like to keep everything in Python, from start to finish. It just makes it easier if I don't have to switch back and forth between other tools. Also, if I have to build in preprocessing as part of a production system, which I've had to do at my job, it makes sense to just do it in Python from the beginning.

Pandas is great for preprocessing, as we'll see, and it can be easily combined with other modules from the scientific Python stack.

Pandas data structures¶

Pandas has several different data structures, but we're going to talk about the Series and the DataFrame.

The Series¶

The Series is a one-dimensional array that can hold a variety of data types, including a mix of those types. The row labels in a Series are collectively called the index. You can create a Series in a few different ways. Here's how you'd create a Series from a list.

In [ ]:

import pandas as pd

some_numbers = [2, 5, 7, 3, 8]

series_1 = pd.Series(some_numbers)
series_1

To specify an index, you can also pass in a list.

In [ ]:

ind = ['a', 'b', 'c', 'd', 'e']

series_2 = pd.Series(some_numbers, index=ind)
series_2

We can pull that index back out again, too, with the .index attribute.

In [ ]:

series_2.index

You can also create a Series with a dictionary. The keys of the dictionary will be used as the index, and the values will be used as the Series array.

In [ ]:

more_numbers = {'a': 9, 'b': 'eight', 'c': 7.5, 'd': 6}

series_3 = pd.Series(more_numbers)
series_3

Notice how, in that previous example, I created a Series with integers, a float, and a string.

The DataFrame¶

The DataFrame is Pandas' most used data structure. It's a two and greater dimensional structure that can also hold a variety of mixed data types. It's similar to a spreadsheet in Excel or a SQL table. You can create a DataFrame with a few different methods. First, let's look at how to create a DataFrame from multiple Series objects.

In [ ]:

combine_series = pd.DataFrame([series_2, series_3])
combine_series

Notice how in column b, we have two kinds of data. If a column in a DataFrame contains multiple types of data, the data type (or dtype) of the column will be chosen to accomodate all of the data. We can look at the data types of different columns with the .dtypes attribute. object is the most general, which is what has been chosen for column b.

In [ ]:

combine_series.dtypes

Another way to create a DataFrame is with a dictionary of lists. This is pretty straightforward:

In [ ]:

data = {'col1': ['i', 'love', 'pandas', 'so', 'much'],
        'col2': ['so', 'will', 'you', 'i', 'promise']}

df = pd.DataFrame(data)
df

File I/O¶

It's really easy to read data into Pandas from a file. Pandas will read your file directly into a DataFrame. There are multiple ways to read in files, but they all work in the same way. Here's how you read in a CSV file:

In [ ]:

wine = pd.read_csv('../data/wine.csv')
wine.head()

Reading in a text file is just as easy. Make sure to pass in '\t' to the delimiter parameter.

In [ ]:

auto_mpg = pd.read_csv('../data/auto_mpg.txt', delimiter='\t')
auto_mpg.head()

Exploring the data¶

Here are some different ways to explore the data we have. Let's first take a look at some of the basic characteristics of the auto_mpg dataset. You can easily find the number of rows and the number of columns a dataframe has using the .shape attribute.

In [ ]:

auto_mpg.shape

You've already seen the head() function, which returns the first five lines in the dataset. To grab the last 5 lines, you can use the tail() function:

In [ ]:

auto_mpg.tail()

Getting column names from a DataFrame is also easy and can be done using the .columns attribute.

In [ ]:

wine.columns

Another useful thing you can do is generate some summary statistics using the describe() function. The describe() function calculates descriptive statistics like the mean, standard deviation, and quartile values for continuous and integer data that exist in your dataset. Don't worry, Pandas won't try to calculate the standard deviation of your categorical values!

In [ ]:

wine.describe()

Another useful thing you can do to explore your data is to sort it. Let's say we wanted to sort our auto_mpg DataFrame by mpg. This is very easy as well:

In [ ]:

auto_mpg.sort(columns='mpg').tail()

Lesson: let's see what's going on in our data!¶

This dataset is data on credit approvals. The column names and data were changed to protect the confidentiality of the data.

In [ ]:

f = '../data/credit_approval.csv'

# How do you read in that file?

# Can you grab just the column names?

In [ ]:

# How many rows and columns does the dataframe have?

In [ ]:

# Now, look at the first 5 lines

In [ ]:

# Now, look at the last 5 lines

In [ ]:

# Can you describe() the data? (Notice how Pandas only "describes" the numerical data!)

In [ ]:

# Let's sort on column H

Working with dataframes¶

Pandas has a ton of functionality for manipulating and wrangling the data. Let's look at a bunch of different ways to select and subset our data.

Selecting columns and rows¶

There are multiple ways to select by both rows and columns. From index to slicing to label to position, there are a variety of methods to suit your data wrangling needs.

Let's select just the mpg column from the auto_mpg DataFrame. This works similar to how you would access values from a dictionary:

In [ ]:

auto_mpg['mpg']

You can do exactly the same thing by using mpg as an attribute:

In [ ]:

auto_mpg.mpg

To extract rows from a DataFrame, you can use the slice method, similar to how you would slice a list. Here's how we would grab rows 7-13 from the wine DataFrame:

In [ ]:

wine[7:14]

Pandas also has tools for purely label-based selection of rows and columns using the .loc indexer. The .loc indexer takes input as [row, column].

For example, let's say we wanted to select the abv value in the 8th instance in our wine DataFrame:

In [ ]:

wine.loc[8,'abv']

We can also use .loc to grab slices. It's important to note that .loc interprets the index as a label. This means that, if we select a range, it will grab the last item in the range, unlike slicing in a list. The index is the label for the rows. Let's grab the abv for rows 8 to 11 from the wine DataFrame.

In [ ]:

wine.loc[8:11, 'abv']

And, as you might expect, we can select multiple columns by passing in a list of column names. Let's also grab ash and color for rows 8 to 11.

In [ ]:

wine.loc[8:11, ['abv', 'ash', 'color']]

Finally, let's just grab all columns for rows 8 to 11.

In [ ]:

wine.loc[8:11, :]

So, .loc provides functionality for a very specific and precise selection method.

Pandas has tools for purely position-based selection of rows and columns using the .iloc indexer, which works exactly how slicing a list works. The .iloc indexer also takes input as [row, column], but takes only integer input. If we wanted to access the 60th row and the model value from auto_mpg, it would look like this (remember that integer indexing is 0-based):

In [ ]:

auto_mpg.iloc[60, 6]

To grab rows 60-63 and the last three columns from the auto_mpg DataFrame, we would need to do the following:

In [ ]:

auto_mpg.iloc[60:64, 6:9]

.iloc again works like slicing a list, based on position, so it does not grab the last item, like .loc does.

To grab all values and those last three columns from the auto_mpg DataFrame:

In [ ]:

auto_mpg.iloc[:, 6:9]

One of my favorite methods for selecting data is through boolean indexing. Boolean indexing is similar to the WHERE clause in SQL in that it allows you to filter out data based on certain criteria. Let's see how this works.

Let's select from the wine DataFrame where wine_type is type 1.

In [ ]:

wine[wine['wine_type'] == 1]

This works with any comparison operators, like >, < >=, !=, and so on. For example, we can select everything from the wine DataFrame where the value in the magnesium column is less than 100.

In [ ]:

wine[wine['magnesium'] < 100]

You can also say 'not' with the tilde: ~

Let's select from the wine DataFrame where magnesium is NOT less than 100, which is equivalent to saying greater than or equal to.

In [ ]:

wine[~(wine['magnesium'] < 100)]

It's also possible to combine these boolean indexers. Make sure you enclose them in parentheses. This is something I usually forget.

Let's select from wine where magnesium is less than 100 and the type of wine is type 1.

In [ ]:

wine[(wine['magnesium'] < 100) & (wine['wine_type'] == 1)]

If you wanted to, you could just keep on chaining the booleans together. Let's add on where the abv is greater than 14.

In [ ]:

wine[(wine['magnesium'] < 100) & (wine['wine_type'] == 1) & (wine['abv'] > 14)]

Another method of selecting data is using the isin() function. If you pass in a list to isin(), it will return a DataFrame of booleans. True means that the value at that index is in the list you passed into isin().

Let's take the first five rows of the auto_mpg DataFrame and check for certain values existing in the DataFrame.

In [ ]:

auto_mpg_5 = auto_mpg.head()

vals = [8, 150, 12.0, 'ford torino']
auto_mpg_5.isin(vals)

If it says True, it means that one of the values from the vals list occurs there.

Lesson: let's try some of these on some data!¶

In [ ]:

# Extract column C from the credit_approval dataframe we read in above

In [ ]:

# Slice rows 5-10 from the credit_approval dataframe

In [ ]:

# How would you look up the value for the 13th row in column C by label (loc)?

In [ ]:

# How would you look up the same thing by position (iloc)?

In [ ]:

# What if I wanted to select all data from credit_approval based on column C being greater than 5?

In [ ]:

# What if I wanted to select data based on column C being greater than 5 and column F being equal to 'w'?

In [ ]:

# What if I wanted to look at a boolean DataFrame of where values are in ['t', 's', 100, 0] in credit_approval?

Groupby¶

groupby() is just like SQL's 'group by' clause. What groupby does is a three-step process:

Split the data
Apply a function to the split groups
Recombine the data

In the apply step, you can do things like apply a statistical function, filter out data, or transform the data.

Let's groupby() the wine_type in our wine DataFrame! Let's start with just groupby(), and then build it from there. This will produce a DataFrame groupby object.

In [ ]:

wine.groupby('wine_type')

Not so interesting yet. This object has some attributes you can access. We can get lists of which rows are in which group by using the .groups attribute:

In [ ]:

wine.groupby('wine_type').groups

The dataset was in order by wine_type to begin with, so that makes sense. To get just the keys, add the .keys() function to the end of that line.

In [ ]:

wine.groupby('wine_type').groups.keys()

Let's group our auto_mpg dataset by cylinders, just for contrast.

In [ ]:

auto_mpg.groupby('cylinders').groups

You can see we have four observations with three cylinders, many more with four, and so on.

Going back to the wine example, let's apply an aggregate function. Let's generate the mean of all the other values and group them by wine_class.

In [ ]:

wine.groupby('wine_type').mean()

So, the mean abv for wine with type 1 is 13.74, type 2 is 12.27, type 3 is 13.15. The mean malic_acid for wine with type 1 is 2.01, and so on. So, with one line of code, we're able to apply a function to the entire dataset and see what's going on within different groups.

Selecting from a groupby DataFrame works the same way as selecting from any other DataFrame. Let's select the abv where wine_type is 2.

In [ ]:

wine_type_mean = wine.groupby('wine_type').mean()

wine_type_mean.loc[2, 'abv']

It's also possible to apply multiple functions to the entire DataFrame using the agg() function. Let's get not only the mean, but the count and the standard deviation as well for each value in the DataFrame, still grouping by wine_type.

In [ ]:

wine.groupby('wine_type').agg(['mean', 'count', 'std'])

It's also possible to run different functions on different columns. Let's get the mean for abv, the standard deviation for ash, and the sum of the values for hue. To do this, you'll need to create a dictionary with these functions, with the column names as the dictionary keys.

In [ ]:

multiple_funcs = {'abv': 'std', 'ash': 'mean', 'hue': sum}

wine.groupby('wine_type').agg(multiple_funcs)

Lesson: Groupby galore¶

Let's take this one step at a time.

In [ ]:

# Let's group credit_approval by column G.

In [ ]:

# Can you generate a list of all of the groups in the groupby object we just made?

In [ ]:

# Let's use mean() on credit_approval_group to get the mean of our numeric values.

In [ ]:

# Let's see both the standard deviation and the sum of everything in credit_approval_group

In [ ]:

# Let's see the count on column H, the sum on column C, and the mean on column O.

Merge/join; or, how Pandas can be like SQL¶

In Pandas, it's possible to combine DataFrames and Series much like you would in SQL. For the examples in this section, we'll work with smaller DataFrames rather than our datasets. It's easier to provide proof of concept this way, as well as explain what's going on

Let's start by appending a row to a DataFrame. We can do that by passing in a dictionary to the append function, and setting ignore_index equal to True.

In [ ]:

data = pd.DataFrame({'col1': ['i', 'love', 'pandas', 'so', 'much'],
        'col2': ['so', 'will', 'you', 'i', 'promise']})
data.append({'col1': 'dude', 'col2': 'dude'}, ignore_index=True)

Appending a column is also easy. You can do that by setting a new column name equal to a list or a Series.

In [ ]:

data['col3'] = ['how', 'do', 'you', 'like', 'oscon']
data

However, this will not work if your new column in a different length than the original DataFrame.

In [ ]:

data['col4'] = ['I', 'am', 'too', 'short']
data

Merge¶

You can merge() in different ways, just like joining in SQL. Let's look at an imaginary taco dataset:

In [ ]:

tacos = pd.read_csv('../data/tacos.csv')
tacos

Let's also look at an imaginary taco toppings dataset:

In [ ]:

taco_toppings = pd.read_csv('../data/taco_toppings.csv')
taco_toppings

Notice that we have a unique identifier in each dataset: the name column. We have the same five people. Let's merge these DataFrames together. You don't even need to pass the key to merge; merge() will automatically infer which key to use based on if it exists in both DataFrames.

In [ ]:

pd.merge(tacos, taco_toppings)

By default, merge() performs a left outer join, which means it takes the key from the "left" DataFrame - the DataFrame that is passed in as the first parameter - and matches the right to it.

Generally speaking, full outer joins will join everything as a union, meaning that everything will be joined even if there are missing values; inner joins will join everything as an intersection, meaning that if a value does not appear in a row in a DataFrame, that row will be left out.

Let's look at a couple of other ways of merging. First, let's append a row to our tacos DataFrame.

In [ ]:

tacos = tacos.append({'name': 'Dan', 'restaurant': 'Tres Carnes', 'number_of_tacos': 7, 'score': 3.8}, ignore_index=True)
tacos

Now, let's do a full outer merge.

In [ ]:

pd.merge(tacos, taco_toppings, how='outer')

You can see that the entire tacos DataFrame has been merged, even though 'Dan' does not exist in the taco_toppings DataFrame.

However, if we do the same thing and use a right outer join, we'll only use the keys from the taco_toppings DataFrame and Dan will be left out.

In [ ]:

pd.merge(tacos, taco_toppings, how='right')

Join¶

The join() function gives you a way way to combine DataFrames without needing a key. Taco_extra, which contains data about chips and spiciness level, has no name column.

In [ ]:

taco_extra = pd.read_csv('../data/taco_extra.csv')
taco_extra

It's easy to join this to our taco DataFrame.

In [ ]:

tacos.join(taco_extra)

You can also specify how to join. The default is outer, but we can change it to inner and Dan will be left out again.

In [ ]:

tacos.join(taco_extra, how='inner')

It's possible to join more than two DataFrames at a time. Let's slice off the name column from taco_toppings.

In [ ]:

taco_toppings_noname = taco_toppings.iloc[:, 1:]

taco_toppings_noname

Joining this frame with tacos and taco_extra is as easy as chaining two joins together. Again, it's all an outer join, so even though there's no toppings or extra data for Dan, he's still included in the DataFrame.

In [ ]:

tacos.join(taco_toppings_noname).join(taco_extra)

Lesson: Let's merge some dataframes!¶

In [ ]:

# Can you merge following DataFrames together?
pizza = pd.read_csv('../data/pizza.csv')
pizza_toppings = pd.read_csv('../data/pizza_toppings.csv')

# Merge them here

In [ ]:

# Let's inner merge those DataFrames

In [ ]:

# Let's join pizza to another dataset, pizza_extra
pizza_extra = pd.read_csv('../data/pizza_extra.csv')

In [ ]:

# Let's only join them together where all the data is present

In [ ]:

# Can you join all three dataframes together, first by merging pizza and pizza_toppings, then joining that to pizza_extra?

Pivoting¶

You can pivot in Pandas just like you would in Excel. pivot_table() takes in four requires parameters: the DataFrame, the column to use for the index, the column to use for the columns, and the column to use for the values. pivot_table() also has an aggfunc parameter that defaults to the mean of the values, but you can pass in other functions, just as we did in the agg() function before.

Let's look at the mean weight per model number and number of cylinders combination.

In [ ]:

pd.pivot_table(auto_mpg, values='weight', index='model', columns='cylinders')

If a cell contains NaN, it means that that combination doesn't exist within the DataFrame.

We can pass in multiple column names to the rows and cols parameters. This creates a multiindex.

If we add the origin column to our pivot table, we can look at the average weight of all of the model/origin combinations against the number of cylinders the cars have.

In [ ]:

pd.pivot_table(auto_mpg, values='weight', index=['model', 'origin'], columns='cylinders')

You can apply different aggregate functions to a pivot table. Let's look at the total weight per model/cylinder combination.

In [ ]:

pd.pivot_table(auto_mpg, values='weight', index='model', columns='cylinders', aggfunc='sum')

Lesson: let's pivot!¶

In [ ]:

# Create a pivot_table for credit_approval with column A as the index, column J as the columns, and column H as the values.

In [ ]:

# Now, change the aggfunc to the standard deviation.

In [ ]:

# Finally, can you come up with your own pivot_table?

For those using IPython Notebook/Wakari/NBViewer: Go to the data_analysis notebook!¶

For those using code files, go to data_analysis.py!¶