In this notebook, you will have an opportunity to explore several different ways of merging data using pandas and pasndasql.
As you will see, the pandas DataFrame()
is a very flexible data structure that supports many operations in memory that have traditionally been performed by database management systems.
The data we will use for this activity comes from the Department for Communities and Local Government Open Data Communities website. Two sorts of data have been downloaded - information about the average weekly social rent of new PRP (Private Registered Providers) general needs lettings for 2012/13 and data relating to housebuilding, in particular the permanent dwellings started from 2009/10 to 2012/13.
!ls data/housingdata
The house building data files all have a similar form:
!head data/housingdata/house-building-starts-tenure-2009-2010.csv
!head data/housingdata/house-building-starts-tenure-2010-2011.csv
The lettings data uses a scheme for identifying the different local authorities:
!head data/housingdata/households-social-lettings-general-needs-rents-prp-number-bedrooms-2012-2013.csv
I have also pulled down a file from the Ordnance Survey that contains a list of geographical areas within the Yorkshire and the Humber region, some of which are local councils and some of which aren't... Note that the data that identifies each authority appears to resemble that used in the DCLG data files but does not match exactly.
!head -n 5 data/housingdata/yorksAndHumberside.csv
We can load the data in from the CSV files using the pandas read_csv()
function. For the housing data, we need to skip the first five lines (I counted!) of the file before accepting the header.
import pandas as pd
#Read in some of the data
bldg_2009_10=pd.read_csv('data/housingdata/house-building-starts-tenure-2009-2010.csv',skiprows=5)
bldg_2010_11=pd.read_csv('data/housingdata/house-building-starts-tenure-2010-2011.csv',skiprows=5)
#Preview the data we have loaded
bldg_2009_10[:5]
#YOUR TURN
#Import the remaning house building files into separate dataframes
#As an additional exercise, you may want to try to develop a script to automate the loading of the data from the separate files
Suppose we want to work with a single dataframe that contains annulaised house building starts data over the period 2009-2013. The pandas concat()
function cill concatenate rows from a list dataframes where each dataframe shares the same column headings.
Let's create a couple of samples from the tables just to try this function out.
#Just use a sample of the data rows for now as we develop the code
sample1=bldg_2009_10[:3]
sample2=bldg_2010_11[:3]
sample2
We can visualise the shape of the sample dataframe using a wrapper around ipythonblocks.
import pBlocks_demo as pb
pb.pBlockGrid3(sample1)
This shows us that there are 6 columns and 3 rows in the dataset. The columns are coloured according to the datatype of each column.
#Try out the .concat() function - pass in a list of dataframes to be concatenated
pd.concat( [sample1, sample2] )
That should have worked okay... We can visualise what has happened by colouring the rows that have been joined from each dataframe.
#The pBlockGrid is still all experimental - we should pass a list here for example...
df_concat,b=pb.pBlockGrid5(sample1, sample2)
b
This shows us that the concatenated table has 6 columns and 6 rows, with 3 rows from the first table and 3 rows from the second.
What happens when we try to merge to two complete dataframes?
bldg_2009_11=pd.concat([bldg_2009_10,bldg_2010_11])
#Check to see if the dataframes appear to have been concatenated together by inspecting row counts
print(len(bldg_2009_10),len(bldg_2010_11),len(bldg_2009_11))
What happens if the dataframes have the same column names, but they appear in a different order?
#Create a sample dataframe containing the same columns as the original but in a different order
sample3=bldg_2009_10[['Reference area','All','Housing-Associations','http://opendatacommunities.org/def/ontology/geography/refArea','Local-Authority','Private-Enterprise']][:3]
sample3
#Let's look at the shape and datatype structure of sample1
pb.pBlockGrid3(sample1)
#And sample2? We see the columns are differently ordered in terms of datatype
pb.pBlockGrid3(sample3)
#Concatenate some sample dataframes with the same columns, but differently ordered
concat_difforder = pd.concat([sample1,sample3])
concat_difforder
#What is the datatype structure of the resulting concatenated datatable?
pb.pBlockGrid3(concat_difforder)
pandas is capable of automatically aligning the columns from such data frames.
What happens if we try to concatenate dataframes in which the dataframes only partially share columns?
#Create a sample dataframe that contains only a subset of the columns from an original dataframe
sample4=bldg_2009_10[['Reference area','All','Housing-Associations']][:3]
sample4
#Concatenate two data frames with non-identical columns
concat_diffcolumns = pd.concat([sample1,sample4])
concat_diffcolumns
We can colour any cells that have missing values to show the 'holes' in the resulting dataframe:
#Here I am going to "colour" the missing cells white.
#The colour field accepts (red, green, blue) values in the range 0..255 each
pb.pBlockGrid4(concat_diffcolumns,color_NA=(255,255,255))
The concat()
function aligns columns where it can. By default, the columns in the combined dataframe are the superset of distinctly named columns in the concatenated data frame. Missing values are given a NaN
value.
This form of concatenation is an outer join in the sense that we are producing a set of columns in the output that represent the union of columns contained in the concatenated datasets.
The concat()
functions uses the outer style join by default, but we can also force it to adopt an inner join behaviour in which the columns in the output data frame correspond to the intersection of columns from the input data frames compared to the union of the outer join.
#Explicitly use an INNER join ("inner") on the concatenation; "outer" is the default value
concat_inner = pd.concat([sample1,sample4],join='inner')
concat_inner
#pBlockGrid TO DO
pb.pBlockGrid5(sample1,sample4,join='inner')[1]
Exercise: What problems, if any, can you see in interpreting the data in any of the concatenated data sets produced above, and how might they be resolved?
Answer: although the data items represent reports from different years, we have lost that information. The year the reports refer to are not encoded in the actual rows of data. If we add an additional column to each dataset as it is loaded in that contains the year the report relates to, we can carry that information in to the concatenated data set.
So how might we add in an additional data column that identifies the period the data relates to before we concatenate the separate data frames?
#YOUR ATTEMPT HERE
Here's how I did it:
#Now add a Period column to each annual building dataframe that contains an appropriate value
#Create a single dataframe containing all the housebuilding data distinguishable by period
By inspection of the building start data and the lettings data, we see that data elements are keyed on common columns - geographical reference area codes, and names.
bldg_2012_13 = pd.read_csv('data/housingdata/house-building-starts-tenure-2012-2013.csv',skiprows=5)
bldgSample=bldg_2012_13[:3]
bldgSample
lettings_2012_13=pd.read_csv('data/housingdata/households-social-lettings-general-needs-rents-prp-number-bedrooms-2012-2013.csv',skiprows=5)
lettingsSample=lettings_2012_13[:3]
lettingsSample
It is straightforward to merge the tables horizontally using the pandas .merge()
function. The first two arguments specify the datatables to be merged. Where the columns that act as the focus for merging share the same name, we can specify them in a list assigned to the on
parameter.
simplemerge=pd.merge(bldgSample,lettingsSample,on=['http://opendatacommunities.org/def/ontology/geography/refArea','Reference area'])
simplemerge
We can visualise the effects of the merge by colouring columns according to whether they are the merge key columns (green), left table (red) or right table (blue).
pb.pBlockGrid6(bldgSample,lettingsSample,on=['http://opendatacommunities.org/def/ontology/geography/refArea','Reference area'])[1]
Here we see that the resulting dataframe has 11 columns, with 2 key columns in green, 4 uniquely contributed columns from the left table in red and 5 uniquely contributed columns from the right table in blue.
Note that we could have also have merged the dataframes on a single column. In this case, duplicate columns are brought in to the merged result separately, and pandas automatically appends a suffix to each one so it remains uniquely labelled in the resulting data frame (so for example we get Reference area_x and Reference area_y in the result).
pd.merge(bldgSample,lettingsSample,on=['http://opendatacommunities.org/def/ontology/geography/refArea'])
pb.pBlockGrid6(bldgSample,lettingsSample,on=['http://opendatacommunities.org/def/ontology/geography/refArea'])[1]
This time we have only a single key, with 5 uniquely named columns from the left table and 6 from the right.
If the column names are differently labelled, we can specify them explicitly for each datatable.
#Rename one of the merge columns in one table
lettingsSample.columns=['Ref Area Code']+lettingsSample.columns[1:].tolist()
lettingsSample
We can explicitly declare the columns we want to merge on from each table using the left_on and right_on parameters (I find this confusing, and would have expected on_left and on_right). For the merge to work, these parameters need to identify the same number of columns in same order.
pd.merge( bldgSample, lettingsSample,
left_on=['http://opendatacommunities.org/def/ontology/geography/refArea','Reference area'],
right_on=['Ref Area Code','Reference area'])
#pBlockGrid - TO DO - cope with left/right keys; colour them separately?
The default behaviour of merge is an inner join (how='inner'
) where the results table is formed from the intersection of the joined key column values. Consider the example where one table has additional rows:
bldgSample_long=bldg_2012_13[:4] #4 rows compared to 3 in the lettings sample
bldgSample_long.columns=['Ref Area Code']+bldgSample_long.columns[1:].tolist()
pd.merge(bldgSample_long,lettingsSample,on=['Ref Area Code','Reference area'])
pb.pBlockGrid6(bldgSample_long,lettingsSample,on=['Ref Area Code','Reference area'])[1]
We can use pandasql to achieve an inner join using the following form of SELECT
statement:
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
#At the moment, we need to hack our way round dataframe column names that don't work as sqlite3 column names.
bldgSample_sqldf=bldgSample.copy()
bldgSample_sqldf.rename(columns=lambda x: x.replace(' ','_'), inplace=True)
lettingsSample_sqldf=lettingsSample.copy()
lettingsSample_sqldf.rename(columns=lambda x: x.replace(' ','_'), inplace=True)
q='''
SELECT * FROM bldgSample_sqldf INNER JOIN lettingsSample_sqldf
ON bldgSample_sqldf.Reference_Area = lettingsSample_sqldf.Reference_Area ;
'''
pysqldf(q)
Where the column names being used to match row items are the same across dataframe, we can use the following alterbative syntax:
q='''
SELECT * FROM bldgSample_sqldf INNER JOIN lettingsSample_sqldf
USING(Reference_Area) ;
'''
pysqldf(q)
In a left join we use all the columns from the left table, and matched ones from the right.
Let's generate a long sample from the lettings data but include some different reference areas compared to the building start data, by taking data from the top and the bottom of the original data frame.
lettingsSample_long=pd.concat( [ lettings_2012_13[:2], lettings_2012_13[-2:] ])
lettingsSample_long
#What happens if you try to inner join bldgSample and lettingsSample_long ?
Now try a left join, by setting how='left'
. What happens to the columns from the right hand table for the unmatched rows from the left column?
pd.merge( bldgSample, lettingsSample_long,
on=['http://opendatacommunities.org/def/ontology/geography/refArea','Reference area'],
how='left')
pb.pBlockGrid6( bldgSample, lettingsSample_long,
on=['http://opendatacommunities.org/def/ontology/geography/refArea','Reference area'],
how='left')[1]
Here we see the two key columns, 4 unique columns from the left table and 5 unique columns from the right. The final row shows only missing values in unique right columns - maybe there was no match on the corresponding key values?
bldgSample_sqldf=bldgSample.copy()
bldgSample_sqldf.rename(columns=lambda x: x.replace(' ','_'), inplace=True)
lettingsSample_long_sqldf=lettingsSample_long.copy()
lettingsSample_long_sqldf.rename(columns=lambda x: x.replace(' ','_'), inplace=True)
q='''
SELECT * FROM bldgSample_sqldf LEFT JOIN lettingsSample_long_sqldf
USING(Reference_Area);
'''
pysqldf(q)
Unsurprisingly, a right join is achieved by setting how='right'
. What happens to the columns from the left hand table for the unmatched rows from the right column?
pd.merge( bldgSample, lettingsSample_long,
on=['http://opendatacommunities.org/def/ontology/geography/refArea','Reference area'],
how='right')
pb.pBlockGrid6( bldgSample, lettingsSample_long,
on=['http://opendatacommunities.org/def/ontology/geography/refArea','Reference area'],
how='right')[1]
Although standard SQL does define RIGHT JOIN
pandasql does not implement a RIGHT JOIN
. However, we could simularte it by swapping the order of the datarames/tables passed to the join and using a LEFT JOIN
instead.
A full outer join can be achieved by setting how='outer'
. What happens to the unmatched rows from each table?
pd.merge( bldgSample, lettingsSample_long,
on=['http://opendatacommunities.org/def/ontology/geography/refArea','Reference area'],
how='outer')
pb.pBlockGrid6( bldgSample, lettingsSample_long,
on=['http://opendatacommunities.org/def/ontology/geography/refArea','Reference area'],
how='outer')[1]
Although standard SQL does define FULL OUTER JOIN
pandasql does not implement it.
Let's generate a sample data frame that has several rows containing the same reference area so we ca explore what happens when we try to merge a dataframe with with one unique reference area per row with a dataframe where there may be multiple rows matching the key values.
bldg_sample_mixed = pd.concat([ bldg_2009_10[:2], bldg_2012_13[:2] ])
bldg_sample_mixed
What happens for the various joins (inner, left, right, outer) when applied to bldg_sample_mixed
and lettingsSample_long
?
#Your experiments here... inner join on bldg_sample_mixed and lettingsSample_long
##I'm not sure the pBlocks stuff handles this sensible yet? What would we want the blocks to display?
#Your experiments here... left join on bldg_sample_mixed and lettingsSample_long
#Your experiments here... right join on bldg_sample_mixed and lettingsSample_long
#Your experiments here... outer join on bldg_sample_mixed and lettingsSample_long
Get them to reflect on challenges, issues and possible problems?
Where a common identifier scheme is used to identify the same element or entity that is represented in several datasets, it is easy enough to merge the datasets using the column that contains the common identifiers. In the above examples, we were able to merge data about housing build starts and letting prices across UK administrative areas using the the reference area names and/or codes.
In some cases, however, the identifiers used in one dataset may only partially match the identifiers in another. Sometimes, it is possible for us to recreate the identifiers used in one scheme from the identifiers used in another. For example, given a reference area code such as E06000001 I can generate the identifier http://statistics.data.gov.uk/id/statistical-geography/E06000001
. The identifier takes the pattern http://statistics.data.gov.uk/id/statistical-geography/AREACODE
so given an AREACODE
we can recreate the identifier.
At other times, the partial match may be more pernicious. For example, is "Open Uni" the same as "Open University"? More involved processes are required to cope with such considereations, which we will ignore for now.
The data file housingdata/yorksAndHumberside.csv
contains a list of administrative areas in the Yorkshire and Humberside adminstrative area. There are three columns in the dataset, taking the form http://data.ordnancesurvey.co.uk/id/7000000000022028, NorthYorkshire, E10000023
.
pd.read_csv('data/housingdata/yorksAndHumberside.csv')[15:20]
These contrast with the identifiers used in the building start and lettings price datasets, which take the form http://statistics.data.gov.uk/id/statistical-geography/E06000002,E06000002 Middlesbrough
.
Exercise:
If the Yorkshire and Humber area dataset used the same identifier scheme as the housing start data, how could you generate a dats set that just listed the housing starts for the 2012-13 period in that area?
In what ways can you generate a common administrative area identifier scheme for the housing start data and the Yorkshire and Humber area data?
Exercise
Generate a data table that contains the 2012-13 housing data for just those areas whose area codes are in the Yorkshire and Humber region.
Which area in that region has the largest total number of housing starts? Which are has the largest number of Housing Association starts?
Data tables can be joined vertically in pandas using the .concat()
function and horizontally using .merge()
. Inner, out, left and right joins are all supported by .merge()
, and inner and outer joins by .concat()
.
In pandasql, we can use the JOIN
operator, suitably modified, to provide a similar range of merge operations.