Complex data selections with multiple valid values using Pandas¶

¶

Main concepts covered:¶

How to use the cross-section (.xs) method of a dataframe to select data where N seperate levels of the index are equal to N seperate values
How to use the .select method of a dataframe to select data where N levels of the index meet arbitrary criteria

First, let's load some fake data, and set an index¶

Read the data using the top-level read_csv function.
Define the index like we would a primary key in a database.
Also pass in a string representation of the values that indicated missing/NA data

In [1]:

import StringIO
import pandas
from IPython.display import HTML

data_string = StringIO.StringIO("""location,storm,station,parameter,result,qual
Airport,1,Inflow,Lead,4.1,J
Airport,1,Inflow,TSS,2.2,=
Airport,1,Outflow,Lead,1.3,=
Airport,1,Outflow,TSS,9.4,=
Airport,2,Inflow,Lead,2.5,=
Airport,2,Inflow,TSS,6.6,=
Airport,2,Outflow,Lead,1.7,=
Airport,2,Outflow,TSS,2.8,=
Airport,3,Inflow,Lead,1.9,J
Airport,3,Inflow,TSS,8.0,=
Airport,3,Outflow,Lead,1.1,=
Airport,3,Outflow,TSS,8.2,=
Airport,4,Inflow,Lead,-999,-999
Airport,4,Inflow,TSS,10.4,=
Airport,4,Outflow,Lead,1.5,=
Airport,4,Outflow,TSS,6.6,=
Airport,5,Inflow,Lead,-999,-999
Airport,5,Inflow,TSS,7.8,J
Airport,5,Outflow,Lead,1.9,J
Airport,5,Outflow,TSS,2.0,=
School,1,Inflow,Lead,1.1,=
School,1,Inflow,TSS,10.2,=
School,1,Outflow,Lead,1.3,=
School,1,Outflow,TSS,10.4,=
School,2,Inflow,Lead,1.5,J
School,2,Inflow,TSS,2.6,=
School,2,Outflow,Lead,2.7,=
School,2,Outflow,TSS,9.8,=
School,3,Inflow,Lead,1.9,=
School,3,Inflow,TSS,-999,=
School,3,Outflow,Lead,2.1,=
School,3,Outflow,TSS,4.2,=
Business,1,Inflow,Lead,-999,-999
Business,1,Inflow,TSS,7.4,=
Business,1,Outflow,Lead,1.5,=
Business,1,Outflow,TSS,9.6,=
Business,2,Inflow,Lead,4.7,J
Business,2,Inflow,TSS,8.8,=
Business,2,Outflow,Lead,3.9,J
Business,2,Outflow,TSS,3.0,=
Business,3,Inflow,Lead,1.1,J
Business,3,Inflow,TSS,5.2,=
Business,3,Outflow,Lead,2.3,J
Business,3,Outflow,TSS,5.4,=
""")
raw_data = pandas.read_csv(data_string, index_col=['location', 'storm', 'station', 'parameter'], na_values='-999')
HTML(raw_data.to_html())

Out[1]:

				result	qual
location	storm	station	parameter
Airport	1	Inflow	Lead	4.1	J
		Inflow	TSS	2.2	=
		Outflow	Lead	1.3	=
		Outflow	TSS	9.4	=
	2	Inflow	Lead	2.5	=
		Inflow	TSS	6.6	=
		Outflow	Lead	1.7	=
		Outflow	TSS	2.8	=
	3	Inflow	Lead	1.9	J
		Inflow	TSS	8.0	=
		Outflow	Lead	1.1	=
		Outflow	TSS	8.2	=
	4	Inflow	Lead	NaN	NaN
		Inflow	TSS	10.4	=
		Outflow	Lead	1.5	=
		Outflow	TSS	6.6	=
	5	Inflow	Lead	NaN	NaN
		Inflow	TSS	7.8	J
		Outflow	Lead	1.9	J
		Outflow	TSS	2.0	=
School	1	Inflow	Lead	1.1	=
		Inflow	TSS	10.2	=
		Outflow	Lead	1.3	=
		Outflow	TSS	10.4	=
	2	Inflow	Lead	1.5	J
		Inflow	TSS	2.6	=
		Outflow	Lead	2.7	=
		Outflow	TSS	9.8	=
	3	Inflow	Lead	1.9	=
		Inflow	TSS	NaN	=
		Outflow	Lead	2.1	=
		Outflow	TSS	4.2	=
Business	1	Inflow	Lead	NaN	NaN
		Inflow	TSS	7.4	=
		Outflow	Lead	1.5	=
		Outflow	TSS	9.6	=
	2	Inflow	Lead	4.7	J
		Inflow	TSS	8.8	=
		Outflow	Lead	3.9	J
		Outflow	TSS	3.0	=
	3	Inflow	Lead	1.1	J
		Inflow	TSS	5.2	=
		Outflow	Lead	2.3	J
		Outflow	TSS	5.4	=

Simple Demos¶

Cross-sections (`.xs`)¶

The cross section method can apply single-equality criteria to any number of levels of an index at the same time (and criteria, but not or).

These are very efficient and optimized in pandas and act as vectorized operation on the index

It takes the form:

subset = dataframe.xs([val1, val2, ... , valN], level=[col1, col2, ..., colN]) where col1 through colN are valid index levels

One thing to notice about taking cross sections is that the index levels against which you apply the criteria are dropped (since they are now a single, known value). This is not the case with selections using .select() (more on that later).

So for this, let's select all of the TSS data at the Business¶

If we were speaking SQL, we'd say:

select * 
from raw_data
where location = 'Business'
  and parameter = 'TSS'

Here it is in pandas:

In [2]:

business_tss = raw_data.xs(['Business', 'TSS'], level=['location', 'parameter'])
business_tss

Out[2]:

		result	qual
storm	station
1	Inflow	7.4	=
1	Outflow	9.6	=
2	Inflow	8.8	=
2	Outflow	3.0	=
3	Inflow	5.2	=
3	Outflow	5.4	=

Selection (`.select`)¶

Selections allow us to more arbitrarily apply criteria to the index. This is less effiecient because it has to loop through each row of the index and apply a user-defined seletion function to each row.

To use a selection you have to define a function that will accept a row from the index, inspect it, and decide to either return True if you want the data or False if you don't.

So let's pull out effluent lead data from either buisness or the airport when the storm was number 3 or higher and influent TSS data from the school. Again, in SQL we'd write:

select *
from raw_data
where (
    location in ('aiport', 'business')
and storm >= 3
and station = 'Outflow'
and parameter = 'Lead'
) or (
    location = 'school'
and station = 'Outflow'
and parameter = 'TSS'
)

And in pandas...

In [3]:

def criteria(row):
    if (row[0] in ['Airport', 'Business']) & (row[1] >= 3) & (row[2] == 'Outflow') & (row[3] == 'Lead'):
        return True
    elif (row[0] == 'School') & (row[2] == 'Inflow') & (row[3] == 'TSS'):
        return True
    else:
        return False
    
subset = raw_data.select(criteria)
subset 

Out[3]:

				result	qual
location	storm	station	parameter
Airport	3	Outflow	Lead	1.1	=
	4	Outflow	Lead	1.5	=
	5	Outflow	Lead	1.9	J
School	1	Inflow	TSS	10.2	=
	2	Inflow	TSS	2.6	=
	3	Inflow	TSS	NaN	=
Business	3	Outflow	Lead	2.3	J

Advanced Demo¶

Premise:¶

We have:¶

Influent and effluent TSS and lead concentrations measured at 3 different location over the course of multiple storms.

We want:¶

Median influent lead concentration for each location with more than 3 valid datapoints

Approach¶

We cross-section the rows using the .xs method of the dataframe for the parameter and monitoring locations
We'll also drop the missing values b/c we don't want them with .dropna
Aggregate the data by the location column to get the number of results
Subset the data further by selecting the rows where the data aggregated on location has more than three results

In [4]:

# select out just the influent lead data
lead = raw_data.xs(['Lead', 'Inflow'], level=['parameter', 'station'], axis=0)

# remove rows missing data
lead = lead.dropna()

lead

Out[4]:

		result	qual
location	storm
Airport	1	4.1	J
	2	2.5	=
	3	1.9	J
School	1	1.1	=
	2	1.5	J
	3	1.9	=
Business	2	4.7	J
Business	3	1.1	J

Now we need to count how many valid results we have at each `location`¶

Just like in a database, we use a `groupby` statement¶

In [5]:

all_lead_studies = lead.groupby(level='location').size()

all_lead_studies

Out[5]:

location
Airport     3
Business    2
School      3
dtype: int64

Then we can select all the BMPs with more than 3 results¶

In [6]:

good_lead_studies = all_lead_studies[all_lead_studies >= 3]

good_lead_studies

Out[6]:

location
Airport     3
School      3
dtype: int64

Here is where the really fancy stuff happens.¶

With the .xs method, we can really efficiently (computer time) select out single values from an index. But what if we need more? That's where the .select method comes in. .select takes in a function that looks at each row of the index, and based on the values at that row, returns either True or False. Where that function returns True, we get data back. Where that function function returns False, data are omitted.

In [7]:

def is_bmp_valid(index_row):
    if index_row[0] in good_lead_studies:
        return True
    else:
        return False

lead = lead.select(is_bmp_valid)

## if you know what lambdas are, you can also do:
# lead = lead.select(lambda x: x[0] in good_lead_studies)
lead

Out[7]:

		result	qual
location	storm
Airport	1	4.1	J
	2	2.5	=
	3	1.9	J
School	1	1.1	=
	2	1.5	J
	3	1.9	=

Now we can compute the median of the whole dataset en masse¶

In [8]:

lead.median()

Out[8]:

result    1.9
dtype: float64

Or look at each `location` individually¶

In [9]:

lead.groupby(level='location').median()

Out[9]:

	result
location
Airport	2.5
School	1.5