Objectives:
- Be able to load and save data using pandas
- Be able to access columns, rows, and elements in DataFrames and Series objects
- Be able to perform aggregate computations across different variables
- Be able to filter and subset data
pandas is a Python library that provides tools for processing and manipulating data.
Typically, you will see pandas imported as "pd
", which is shorter and therefore easier to type than the full name pandas
:
import pandas as pd
First and foremost, pandas gives us a really convenient way to read in data in CSV ("comma separated value") format. In this lesson, we have two CSV files containing information about precipitation in California. The first file we'll take a look at is precip_monthly.csv
, which contains monthly aggregate data:
!head precip_monthly.csv
To load it, we call read_csv
function, and pandas automatically figures out how to read the file for us:
monthly = pd.read_csv("precip_monthly.csv")
monthly
The type of object that is returned is called a "data frame", and is one of two fundamental data types that pandas uses:
type(monthly)
A data frame is essentially a table which has labeled rows and columns. By default when reading in a csv file, pandas will create a numerical index for the rows, and will use the first row of the CSV as the column names. However, having a numerical index isn't necessarily what we want. In this case, it might be more useful for us to have the rows correspond to different stations and the dates the data was recorded. To do this, we can use the set_index
method of the data frame object:
monthly = monthly.set_index('station')
monthly
To access a column in a DataFrame, we index into the DataFrame as if it were a dictionary. For example, to get just the precipitation for each station and date:
monthly['precip']
The type of object that is returned is a Series
object, which is the 1D equivalent of a DataFrame. We can further index into this Series object, for example, to get the precipitation for one particular station:
monthly['precip']['San Jose']
If we want to access the data the other way around -- i.e., access the row(s) first, and then the 'precip'
column -- we need to index slightly differently, using the .loc
attribute:
monthly.loc['San Jose']
This returns another DataFrame, which we can then index as we saw earlier:
monthly.loc['San Jose']['precip']
To summarize:
# column indexing --> Series
monthly['precip']
# column, then row indexing --> Series or element
monthly['precip']['San Jose']
# row indexing --> DataFrame or Series
monthly.loc['San Jose']
# row, then column indexing --> Series or element
monthly.loc['San Jose']['precip']
Load the monthly data again from the CSV file and extract only the data from the month of July.
Hint: Take a look at monthly.head()
and monthly.columns
to see what data is available to you.
Hint: Use .set_index()
to set a certain column as the DataFrame index.
Let's say we want to save out just the San Jose data. We can do this using the .to_csv()
method of the DataFrame:
monthly.loc['San Jose'].to_csv('san_jose.csv')
!head san_jose.csv
One of the really powerful operations that pandas can do involves splitting the data up into particular groups, performing some operation on each group, and then recombining the results.
For example, how would we compute the total precipitation per year for each station? To do this, we want to:
In pandas, this is really easy! First, we'll want to "reset" the index of our DataFrame so that everything is in columns:
monthly = monthly.reset_index()
monthly
Now, we use the groupby
command to specify which columns should be used to form the groups:
gb = monthly.groupby(['station', 'year'])
gb
We can index into this "groupby" object just like a DataFrame, and select only the precipitation data:
gb['precip']
Finally, we can calculate summary statistics on these groups. For example, a sum of the precipitation each year for each station:
gb['precip'].sum()
The result is a Series object that has as its index labels for the stations and years. The values of the Series objects are the total precipitation for the corresponding station and year.
Similar computations follow the same basic recipe. For example, to compute the average precipitation per month:
gb['precip'].mean()
The .apply()
function of the groupby object is incredibly powerful, and allows us to perform even more complex computations. We can use the .apply()
function to perform any computation we can write a function for! For example, if we wanted to compute the mean and standard deviation of the precipitation in one go:
def stats(data):
return pd.Series(
[data.mean(), data.std()], # compute the mean and standard deviation of one particular group
index=['mean', 'stddev'], # label the computed statistics
name=data.name # give a name to the result, so pandas knows how to put everything
# back together
)
gb['precip'].apply(stats)
If we want to make these statistics (mean
and stddev
) correspond to columns, rather than an additional level in the index, we can use the .unstack()
method:
gb['precip'].apply(stats).unstack()
Use the .groupby()
method to compute the average yearly precipitation for each region, and modify the resulting DataFrame so that the rows correspond to regions, and the columns correspond to years. Store the result in a variable called region_yearly_precip
.
We've looked at how to group our data into subsets and how to use get subsets using indexing, but there are also other ways to create subsets of data.
Above we used .set_index
to move the 'station'
column into the index then grab the data for the 'San Jose'
station. Here's another way to do that using "boolean indexing":
monthly.loc[monthly.station == 'San Jose']
Another way to filter is with the .isin
method. This tests whether values are in another set of values.
Say we want all the data for the winter months Dec, Jan, Feb:
monthly.loc[monthly.month.isin(['Dec', 'Jan', 'Feb'])]
And the .query
method is useful for more complex queries:
monthly.query('region == "SAN FRANCISCO BAY" and month in ["Dec", "Jan", "Feb"]')
Use any of the above filtering methods to filter the monthly
data to only those rows from the 'COLORADO RIVER'
subregion.
(Bonus: Can you further filter the data so it's only for the years 1999 - 2003?)