from IPython.core.display import HTML
css_file = './custom.css'
HTML(open(css_file, "r").read())
Data downloaded from the Land Registry.
I've had some interest in house prices recently, and there is so much data available. I've also only recently discovered the wonders of IPython notebooks, and of handling data with the Python library pandas. So it seems like a good opportunity to learn how to use these tools and how to deal with geographical data (I have a sneaking suspicion this will be trickier than I suspect).
Doing this I want to learn:
Tools to use:
# Import libraries etc
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import sys
import time
pd.options.display.mpl_style = 'default'
from matplotlib import rcParams
rcParams['figure.figsize'] = (12, 6)
rcParams['figure.dpi'] = 150
rcParams['font.size'] = 16
rcParams['font.family'] = 'sans-serif'
rcParams['axes.facecolor'] = '#ffffff'
rcParams['lines.linewidth'] = 2.0
There is a whole ton of government data for the UK available from http://data.gov.uk/, including loads relating to house prices. There is a handy page from the Land Registry which lets you filter their price paid data and download a subsection. For this plotting exercise I've downloaded all the sold data from Cambridge from 2014, and saved it as a csv file in the res directory. I've also manually added some column headings to the csv so that pandas knows which column is what when the file is loaded.
First off, I'll load the data using pandas read_csv function, and tell it to parse the dates (which are in day first format) and use them as the index to the data.
# Load in the land registry data table (I added a header row)
fullData = pd.read_csv('./res/ppd_data.csv', index_col=2, parse_dates=True, dayfirst = True)
fullData.head()
ID | Price | Postcode | Type | New | Duration | PAON | Number | Street | NA | City | City2 | County | URL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Date | ||||||||||||||
2014-02-19 | 4DDCD83E-05D0-4688-9E46-46B4ADB91100 | 250000 | CB1 1BB | F | N | L | 18 | PETERSFIELD MANSIONS | PETERSFIELD | NaN | CAMBRIDGE | CAMBRIDGE | CAMBRIDGESHIRE | http://landregistry.data.gov.uk/data/ppi/trans... |
2014-08-27 | 84E91E3F-3AA1-42B6-8B01-E7E0532DE415 | 660000 | CB1 1BB | F | N | L | 25 | PETERSFIELD MANSIONS | PETERSFIELD | NaN | CAMBRIDGE | CAMBRIDGE | CAMBRIDGESHIRE | http://landregistry.data.gov.uk/data/ppi/trans... |
2014-01-17 | 8E078C8C-D3AD-48EC-862A-1945791B4E13 | 260000 | CB1 1BB | F | N | L | 32 | PETERSFIELD MANSIONS | PETERSFIELD | NaN | CAMBRIDGE | CAMBRIDGE | CAMBRIDGESHIRE | http://landregistry.data.gov.uk/data/ppi/trans... |
2014-02-20 | CEF302E5-7996-47CD-841A-72E22EC1D7BE | 310000 | CB1 1BB | F | N | L | 55 | PETERSFIELD MANSIONS | PETERSFIELD | NaN | CAMBRIDGE | CAMBRIDGE | CAMBRIDGESHIRE | http://landregistry.data.gov.uk/data/ppi/trans... |
2014-09-01 | D77AB759-F02C-45CD-8B5B-A35D78F7D52D | 345000 | CB1 1BB | F | N | L | 56 | PETERSFIELD MANSIONS | PETERSFIELD | NaN | CAMBRIDGE | CAMBRIDGE | CAMBRIDGESHIRE | http://landregistry.data.gov.uk/data/ppi/trans... |
5 rows × 14 columns
The full data table contains some columns I'm not interested in here, e.g. I specified the search to only include Cambridge, so the city column is redundant. Let's trim down the table a little bit and drop some columns using pandas.
Also, I know there will be some NaN values in the Postcode column which will cause issues later, so lets also drop any rows which have nans in this column.
columnsToDrop = ['ID', 'PAON', 'NA', 'City', 'City2', 'County','URL']
# I also know there are some nans in the postcode field, so lets drop these
cropData = fullData.drop(columnsToDrop,1).dropna(subset=['Postcode'])
Lets also make sure the data looks sensible, and practice processing it a bit with Pandas. We can make a few summary stats and plots:
print('Average price paid in 2014 = £ %d' % np.mean(cropData['Price']))
print('Median price paid in 2014 = £ %d' % np.median(cropData['Price']))
plt.hist(cropData['Price'],70)
plt.xlim([0, 1e6])
plt.xlabel('Purchase price (Pounds)')
plt.ylabel('Frequency')
plt.show()
Average price paid in 2014 = £ 353372 Median price paid in 2014 = £ 295250
monthMeans = cropData['Price'].groupby(cropData.index.month).aggregate(np.mean)
plt.plot(monthMeans)
plt.xlabel('Month')
plt.ylabel('Mean purchase price (Pounds)')
plt.show()
cropData['month'] = cropData.index.month
cropData.boxplot(column='Price',by='month')
plt.ylim((0, 1000000))
plt.xlabel('Month')
plt.ylabel('Purchase price (Pounds)')
plt.show()
saleCount = cropData['Price'].groupby(cropData.index.month).aggregate(len)
plt.plot(saleCount)
plt.ylim([0, 500])
plt.xlabel('Month')
plt.ylabel('Total sales')
plt.show()
Currently I have a list of postcodes, and need to turn these into geographical data. For this there's a handy api at Postcodes.io and I'll call it with the Python Requests library.
First off, I needed to learn how to use the api (and Requests). But this being Python, it turns out to be pleasingly simple!
# Import some more libraries
import requests
import json
## Practice on a couple of postcodes
# Form the request
url = 'http://api.postcodes.io/postcodes'
payload = {'postcodes' : ['OX49 5NU', 'M32 0JG'] } # Example postcodes
headers = {'content-type': 'application/json'}
# Make the request
r = requests.post(url, data=json.dumps(payload), headers=headers)
returnData = r.json()
# Print out the result, with some nicer formatting
print json.dumps(returnData, sort_keys=True, indent=4, separators=(',', ': '))
{ "result": [ { "query": "OX49 5NU", "result": { "admin_county": "Oxfordshire", "admin_district": "South Oxfordshire", "admin_ward": "Benson", "ccg": "NHS Oxfordshire", "country": "England", "eastings": 464435, "european_electoral_region": "South East", "incode": "5NU", "latitude": 51.6562791294687, "longitude": -1.06993881320412, "lsoa": "South Oxfordshire 011B", "msoa": "South Oxfordshire 011", "nhs_ha": "South Central", "northings": 195686, "nuts": null, "outcode": "OX49", "parish": "South Oxfordshire", "parliamentary_constituency": "Henley", "postcode": "OX49 5NU", "primary_care_trust": "Oxfordshire", "quality": 1, "region": "South East" } }, { "query": "M32 0JG", "result": { "admin_county": null, "admin_district": "Trafford", "admin_ward": "Gorse Hill", "ccg": "NHS Trafford", "country": "England", "eastings": 379988, "european_electoral_region": "North West", "incode": "0JG", "latitude": 53.4556572899372, "longitude": -2.30283674284007, "lsoa": "Trafford 003C", "msoa": "Trafford 003", "nhs_ha": "North West", "northings": 395476, "nuts": null, "outcode": "M32", "parish": null, "parliamentary_constituency": "Stretford and Urmston", "postcode": "M32 0JG", "primary_care_trust": "Trafford", "quality": 1, "region": "North West" } } ], "status": 200 }
Cool! What I'm really interested in here are the latitude and longitude that corresponds to the postcode centre. It's easy to pull these out into a list with some comprhension (also making sure to handle any values that couldn't be successfully looked up).
Now that I know how to get the lat/lon data from postcodes, I'll do the same for the entire dataset.
# Get the postcodes and assemble a list
postcodes = cropData['Postcode']
queryCodes = postcodes.tolist()
# We can only request 100 postcodes at a time
maxPostcodes = 100
nQueries = np.ceil(float(len(queryCodes))/maxPostcodes)
print('%d postcodes, requireing %d requests' % (len(queryCodes), nQueries))
# Assmble parts of the request which are constant
url = 'http://api.postcodes.io/postcodes'
headers = {'content-type': 'application/json'}
lats = []
longs = []
for iQuery in range(int(nQueries)):
# Assemble the postcode query
minQuery = iQuery*maxPostcodes
maxQuery = min(((iQuery+1)*maxPostcodes,len(queryCodes)))
payload = {'postcodes' : queryCodes[minQuery:maxQuery] }
# Make the request
r = requests.post(url, data=json.dumps(payload), headers=headers)
# Lets give the server a rest as it seems to get stressed
time.sleep(0.1)
if r.status_code is 200:
returnData = r.json()
sys.stdout.write('.')
else:
print('Query failed, status: %d' % r.status_code)
result = returnData['result']
# Lets comprehend the list to get the lat and log data
# We need an if statement to catch cases where the postcode cant be found
lats += [ result[idx]['result']['latitude'] if result[idx]['result'] is not None else np.nan for idx in range(len(result)) ]
longs += [ result[idx]['result']['longitude'] if result[idx]['result'] is not None else np.nan for idx in range(len(result)) ]
print("Finished queries")
4124 postcodes, requireing 42 requests ..........................................Finished queries
Now I have some geographical price data itching to be plotted on a map. This turned out to be trickier than I expected.
First we need a map to plot the data on, Stamen Mapstack lets you export really nicely stylised maps (based on osm data), simple enough to be nice and legible, but containing all the street data needed to make sense of the house locations.
im=plt.imread('./res/map.png')
mapFac = 1.5
plt.figure(figsize = (mapFac*9.98, mapFac*6.7))
plt.imshow(im)
plt.axis('off')
plt.title('A pretty map of Cambridge:\n')
plt.show()
Now we have a map in pixels, and some data with latitude and longitude co-ordinates. If I want to plot one on top of the other I need to make these match up nicely. I'm sure there are some tools that let you plot lat/lon data directly on a map, but I want to use the pretty map (as an image) from Stamen. I'm going to use Basemap to plot the points using a Mercator projection that should match the map, but I still need to convert from lat/lon to the pixel units of the map image.
I made the map image by manually panning and zooming to what looked like an appropriate location. The output of Stamen Mapstack gives the centre latitude and longitude, and the zoom level. Finding this useful answer on Stackoverflow (after conversion to Python) lets me convert from these paramters to the bounding lat/lon box of the map.
def getTileNumber(lat,lon,zoom):
xtile = (lon+180)/360 * 2**zoom ;
ytile = (1 - np.log(np.tan(np.radians(lat)) + 1/np.cos(np.radians(lat)))/np.pi)/2 * 2**zoom
return (xtile, ytile)
def getLonLat(xtile, ytile, zoom):
n = 2 ** zoom
lon_deg = xtile / n * 360.0 - 180.0
lat_deg = np.degrees(np.arctan(np.sinh(np.pi * (1 - 2 * ytile / n))))
return (lon_deg, lat_deg)
def latLon2Box(lat, lon, zoom, width, height):
tile_size = 256.0
(xtile, ytile) = getTileNumber(lat, lon, zoom)
xtile_s = (xtile * tile_size - width/2) / tile_size;
ytile_s = (ytile * tile_size - height/2) / tile_size;
xtile_e = (xtile * tile_size + width/2) / tile_size;
ytile_e = (ytile * tile_size + height/2) / tile_size;
(lon_s, lat_s) = getLonLat(xtile_s, ytile_s, zoom);
(lon_e, lat_e) = getLonLat(xtile_e, ytile_e, zoom);
return (lon_s, lat_s, lon_e, lat_e)
(lon_s, lat_s, lon_e, lat_e) = latLon2Box(52.206344321410604, 0.13166427612304688, 13, 998, 670)
from mpl_toolkits.basemap import Basemap
m = Basemap(llcrnrlon=lon_s,llcrnrlat=lat_s,urcrnrlon=lon_e,urcrnrlat=lat_e,
resolution='h',projection='merc')
x1,y1 = m(longs,lats)
xs, ys = m(lon_s, lat_s)
xe, ye = m(lon_e, lat_e)
Now I can convert to pixel units, and try plotting the postcode data!
# Normalise to pixel co-ords
width = 998
height = 670
xPix = width*(np.array(x1) - xs)/(xe - xs)
yPix = height*(np.array(y1) - ys)/(ye - ys)
im=plt.imread('./res/map.png')
plt.figure(figsize = (mapFac*9.98, mapFac*6.7))
plt.imshow(np.flipud(im))
plt.plot(xPix,height-yPix,'o',markersize=5, markeredgecolor='none')
plt.xlim([0, width])
plt.ylim([0, height])
plt.axis('off')
plt.show()
So plotting location data, with price encoded by colour, seems like a reasonable thing to do. I'll also add a bit of dithering, to separate the overlapping points, and I guess it reflects the uncertainty in the postcode based locations.
maxPrice = 600000.0
import matplotlib as mpl
price = cropData['Price']
price = price.tolist()
price = np.divide(price,maxPrice)
price[price>1] = 1
ditherMag = 10
fig = plt.figure(figsize = (mapFac*9.98, mapFac*6.7))
cmap = mpl.cm.cool
plt.imshow(np.flipud(im))
plt.scatter(xPix + ditherMag*(np.random.rand(xPix.size)-0.5),
height - (yPix +ditherMag*(np.random.rand(yPix.size)-0.5)),
c=price, cmap=cmap, s=60)
plt.xlim([0, width])
plt.ylim([0, height])
plt.axis('off')
plt.title('House sales, Cambridge 2014 (Pink = expensive, Blue = cheap)\n')
plt.savefig('res/price_map.jpg')
It worked, and was pretty easy thanks to many useful tools that cleverer people have written and explained how to use.
Now to try and make something informative...