.xs
) method of a dataframe to select data where N seperate levels of the index are equal to N seperate values.select
method of a dataframe to select data where N levels of the index meet arbitrary criteriaread_csv
function.import StringIO
import pandas
from IPython.display import HTML
data_string = StringIO.StringIO("""location,storm,station,parameter,result,qual
Airport,1,Inflow,Lead,4.1,J
Airport,1,Inflow,TSS,2.2,=
Airport,1,Outflow,Lead,1.3,=
Airport,1,Outflow,TSS,9.4,=
Airport,2,Inflow,Lead,2.5,=
Airport,2,Inflow,TSS,6.6,=
Airport,2,Outflow,Lead,1.7,=
Airport,2,Outflow,TSS,2.8,=
Airport,3,Inflow,Lead,1.9,J
Airport,3,Inflow,TSS,8.0,=
Airport,3,Outflow,Lead,1.1,=
Airport,3,Outflow,TSS,8.2,=
Airport,4,Inflow,Lead,-999,-999
Airport,4,Inflow,TSS,10.4,=
Airport,4,Outflow,Lead,1.5,=
Airport,4,Outflow,TSS,6.6,=
Airport,5,Inflow,Lead,-999,-999
Airport,5,Inflow,TSS,7.8,J
Airport,5,Outflow,Lead,1.9,J
Airport,5,Outflow,TSS,2.0,=
School,1,Inflow,Lead,1.1,=
School,1,Inflow,TSS,10.2,=
School,1,Outflow,Lead,1.3,=
School,1,Outflow,TSS,10.4,=
School,2,Inflow,Lead,1.5,J
School,2,Inflow,TSS,2.6,=
School,2,Outflow,Lead,2.7,=
School,2,Outflow,TSS,9.8,=
School,3,Inflow,Lead,1.9,=
School,3,Inflow,TSS,-999,=
School,3,Outflow,Lead,2.1,=
School,3,Outflow,TSS,4.2,=
Business,1,Inflow,Lead,-999,-999
Business,1,Inflow,TSS,7.4,=
Business,1,Outflow,Lead,1.5,=
Business,1,Outflow,TSS,9.6,=
Business,2,Inflow,Lead,4.7,J
Business,2,Inflow,TSS,8.8,=
Business,2,Outflow,Lead,3.9,J
Business,2,Outflow,TSS,3.0,=
Business,3,Inflow,Lead,1.1,J
Business,3,Inflow,TSS,5.2,=
Business,3,Outflow,Lead,2.3,J
Business,3,Outflow,TSS,5.4,=
""")
raw_data = pandas.read_csv(data_string, index_col=['location', 'storm', 'station', 'parameter'], na_values='-999')
HTML(raw_data.to_html())
result | qual | ||||
---|---|---|---|---|---|
location | storm | station | parameter | ||
Airport | 1 | Inflow | Lead | 4.1 | J |
TSS | 2.2 | = | |||
Outflow | Lead | 1.3 | = | ||
TSS | 9.4 | = | |||
2 | Inflow | Lead | 2.5 | = | |
TSS | 6.6 | = | |||
Outflow | Lead | 1.7 | = | ||
TSS | 2.8 | = | |||
3 | Inflow | Lead | 1.9 | J | |
TSS | 8.0 | = | |||
Outflow | Lead | 1.1 | = | ||
TSS | 8.2 | = | |||
4 | Inflow | Lead | NaN | NaN | |
TSS | 10.4 | = | |||
Outflow | Lead | 1.5 | = | ||
TSS | 6.6 | = | |||
5 | Inflow | Lead | NaN | NaN | |
TSS | 7.8 | J | |||
Outflow | Lead | 1.9 | J | ||
TSS | 2.0 | = | |||
School | 1 | Inflow | Lead | 1.1 | = |
TSS | 10.2 | = | |||
Outflow | Lead | 1.3 | = | ||
TSS | 10.4 | = | |||
2 | Inflow | Lead | 1.5 | J | |
TSS | 2.6 | = | |||
Outflow | Lead | 2.7 | = | ||
TSS | 9.8 | = | |||
3 | Inflow | Lead | 1.9 | = | |
TSS | NaN | = | |||
Outflow | Lead | 2.1 | = | ||
TSS | 4.2 | = | |||
Business | 1 | Inflow | Lead | NaN | NaN |
TSS | 7.4 | = | |||
Outflow | Lead | 1.5 | = | ||
TSS | 9.6 | = | |||
2 | Inflow | Lead | 4.7 | J | |
TSS | 8.8 | = | |||
Outflow | Lead | 3.9 | J | ||
TSS | 3.0 | = | |||
3 | Inflow | Lead | 1.1 | J | |
TSS | 5.2 | = | |||
Outflow | Lead | 2.3 | J | ||
TSS | 5.4 | = |
.xs
)¶The cross section method can apply single-equality criteria to any number of levels of an index at the same time (and criteria, but not or).
These are very efficient and optimized in pandas and act as vectorized operation on the index
It takes the form:
subset = dataframe.xs([val1, val2, ... , valN], level=[col1, col2, ..., colN])
where col1
through colN
are valid index levels
One thing to notice about taking cross sections is that the index levels against which you apply the criteria are dropped (since they are now a single, known value). This is not the case with selections using .select()
(more on that later).
If we were speaking SQL, we'd say:
select *
from raw_data
where location = 'Business'
and parameter = 'TSS'
Here it is in pandas:
business_tss = raw_data.xs(['Business', 'TSS'], level=['location', 'parameter'])
business_tss
result | qual | ||
---|---|---|---|
storm | station | ||
1 | Inflow | 7.4 | = |
Outflow | 9.6 | = | |
2 | Inflow | 8.8 | = |
Outflow | 3.0 | = | |
3 | Inflow | 5.2 | = |
Outflow | 5.4 | = |
.select
)¶Selections allow us to more arbitrarily apply criteria to the index. This is less effiecient because it has to loop through each row of the index and apply a user-defined seletion function to each row.
To use a selection you have to define a function that will accept a row from the index, inspect it, and decide to either return True
if you want the data or False
if you don't.
So let's pull out effluent lead data from either buisness or the airport when the storm was number 3 or higher and influent TSS data from the school. Again, in SQL we'd write:
select *
from raw_data
where (
location in ('aiport', 'business')
and storm >= 3
and station = 'Outflow'
and parameter = 'Lead'
) or (
location = 'school'
and station = 'Outflow'
and parameter = 'TSS'
)
And in pandas...
def criteria(row):
if (row[0] in ['Airport', 'Business']) & (row[1] >= 3) & (row[2] == 'Outflow') & (row[3] == 'Lead'):
return True
elif (row[0] == 'School') & (row[2] == 'Inflow') & (row[3] == 'TSS'):
return True
else:
return False
subset = raw_data.select(criteria)
subset
result | qual | ||||
---|---|---|---|---|---|
location | storm | station | parameter | ||
Airport | 3 | Outflow | Lead | 1.1 | = |
4 | Outflow | Lead | 1.5 | = | |
5 | Outflow | Lead | 1.9 | J | |
School | 1 | Inflow | TSS | 10.2 | = |
2 | Inflow | TSS | 2.6 | = | |
3 | Inflow | TSS | NaN | = | |
Business | 3 | Outflow | Lead | 2.3 | J |
Influent and effluent TSS and lead concentrations measured at 3 different location over the course of multiple storms.
Median influent lead concentration for each location with more than 3 valid datapoints
.xs
method of the dataframe for the parameter and monitoring locations.dropna
location
column to get the number of resultslocation
has more than three results# select out just the influent lead data
lead = raw_data.xs(['Lead', 'Inflow'], level=['parameter', 'station'], axis=0)
# remove rows missing data
lead = lead.dropna()
lead
result | qual | ||
---|---|---|---|
location | storm | ||
Airport | 1 | 4.1 | J |
2 | 2.5 | = | |
3 | 1.9 | J | |
School | 1 | 1.1 | = |
2 | 1.5 | J | |
3 | 1.9 | = | |
Business | 2 | 4.7 | J |
3 | 1.1 | J |
all_lead_studies = lead.groupby(level='location').size()
all_lead_studies
location Airport 3 Business 2 School 3 dtype: int64
good_lead_studies = all_lead_studies[all_lead_studies >= 3]
good_lead_studies
location Airport 3 School 3 dtype: int64
With the .xs
method, we can really efficiently (computer time) select out single values from an index. But what if we need more?
That's where the .select
method comes in. .select
takes in a function that looks at each row of the index, and based on the values at that row, returns either True
or False
. Where that function returns True
, we get data back. Where that function function returns False
, data are omitted.
def is_bmp_valid(index_row):
if index_row[0] in good_lead_studies:
return True
else:
return False
lead = lead.select(is_bmp_valid)
## if you know what lambdas are, you can also do:
# lead = lead.select(lambda x: x[0] in good_lead_studies)
lead
result | qual | ||
---|---|---|---|
location | storm | ||
Airport | 1 | 4.1 | J |
2 | 2.5 | = | |
3 | 1.9 | J | |
School | 1 | 1.1 | = |
2 | 1.5 | J | |
3 | 1.9 | = |
lead.median()
result 1.9 dtype: float64
location
individually¶lead.groupby(level='location').median()
result | |
---|---|
location | |
Airport | 2.5 |
School | 1.5 |