In [5]:

from IPython.core.display import HTML
css_file = './custom.css'
HTML(open(css_file, "r").read())

Out[5]:

Cambridge house price data visualisation with IPython¶

Data downloaded from the Land Registry.

I've had some interest in house prices recently, and there is so much data available. I've also only recently discovered the wonders of IPython notebooks, and of handling data with the Python library pandas. So it seems like a good opportunity to learn how to use these tools and how to deal with geographical data (I have a sneaking suspicion this will be trickier than I suspect).

Doing this I want to learn:

Using IPython notebooks
Data processing with pandas
Plotting geographical data
Something about house prices, maybe

Tools to use:

In [6]:

# Import libraries etc
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import sys
import time

pd.options.display.mpl_style = 'default'
from matplotlib import rcParams
rcParams['figure.figsize'] = (12, 6)
rcParams['figure.dpi'] = 150
rcParams['font.size'] = 16
rcParams['font.family'] = 'sans-serif'
rcParams['axes.facecolor'] = '#ffffff'
rcParams['lines.linewidth'] = 2.0

So much data¶

There is a whole ton of government data for the UK available from http://data.gov.uk/, including loads relating to house prices. There is a handy page from the Land Registry which lets you filter their price paid data and download a subsection. For this plotting exercise I've downloaded all the sold data from Cambridge from 2014, and saved it as a csv file in the res directory. I've also manually added some column headings to the csv so that pandas knows which column is what when the file is loaded.

First off, I'll load the data using pandas read_csv function, and tell it to parse the dates (which are in day first format) and use them as the index to the data.

In [7]:

# Load in the land registry data table (I added a header row)
fullData = pd.read_csv('./res/ppd_data.csv', index_col=2, parse_dates=True, dayfirst = True)
fullData.head()

Out[7]:

	ID	Price	Postcode	Type	New	Duration	PAON	Number	Street	NA	City	City2	County	URL
Date
2014-02-19	4DDCD83E-05D0-4688-9E46-46B4ADB91100	250000	CB1 1BB	F	N	L	18	PETERSFIELD MANSIONS	PETERSFIELD	NaN	CAMBRIDGE	CAMBRIDGE	CAMBRIDGESHIRE	http://landregistry.data.gov.uk/data/ppi/trans...
2014-08-27	84E91E3F-3AA1-42B6-8B01-E7E0532DE415	660000	CB1 1BB	F	N	L	25	PETERSFIELD MANSIONS	PETERSFIELD	NaN	CAMBRIDGE	CAMBRIDGE	CAMBRIDGESHIRE	http://landregistry.data.gov.uk/data/ppi/trans...
2014-01-17	8E078C8C-D3AD-48EC-862A-1945791B4E13	260000	CB1 1BB	F	N	L	32	PETERSFIELD MANSIONS	PETERSFIELD	NaN	CAMBRIDGE	CAMBRIDGE	CAMBRIDGESHIRE	http://landregistry.data.gov.uk/data/ppi/trans...
2014-02-20	CEF302E5-7996-47CD-841A-72E22EC1D7BE	310000	CB1 1BB	F	N	L	55	PETERSFIELD MANSIONS	PETERSFIELD	NaN	CAMBRIDGE	CAMBRIDGE	CAMBRIDGESHIRE	http://landregistry.data.gov.uk/data/ppi/trans...
2014-09-01	D77AB759-F02C-45CD-8B5B-A35D78F7D52D	345000	CB1 1BB	F	N	L	56	PETERSFIELD MANSIONS	PETERSFIELD	NaN	CAMBRIDGE	CAMBRIDGE	CAMBRIDGESHIRE	http://landregistry.data.gov.uk/data/ppi/trans...

5 rows × 14 columns

Cleaning up¶

The full data table contains some columns I'm not interested in here, e.g. I specified the search to only include Cambridge, so the city column is redundant. Let's trim down the table a little bit and drop some columns using pandas.

Also, I know there will be some NaN values in the Postcode column which will cause issues later, so lets also drop any rows which have nans in this column.

In [8]:

columnsToDrop = ['ID', 'PAON', 'NA', 'City', 'City2', 'County','URL']
# I also know there are some nans in the postcode field, so lets drop these
cropData = fullData.drop(columnsToDrop,1).dropna(subset=['Postcode'])

Sanity check¶

Lets also make sure the data looks sensible, and practice processing it a bit with Pandas. We can make a few summary stats and plots:

In [20]:

print('Average price paid in 2014 = £ %d' % np.mean(cropData['Price']))
print('Median price paid in 2014  = £ %d' % np.median(cropData['Price']))

plt.hist(cropData['Price'],70)
plt.xlim([0, 1e6])
plt.xlabel('Purchase price (Pounds)')
plt.ylabel('Frequency')
plt.show()

Average price paid in 2014 = £ 353372
Median price paid in 2014  = £ 295250

In [10]:

monthMeans = cropData['Price'].groupby(cropData.index.month).aggregate(np.mean)
plt.plot(monthMeans)
plt.xlabel('Month')
plt.ylabel('Mean purchase price (Pounds)')
plt.show()

In [11]:

cropData['month'] = cropData.index.month
cropData.boxplot(column='Price',by='month')
plt.ylim((0, 1000000))
plt.xlabel('Month')
plt.ylabel('Purchase price (Pounds)')
plt.show()

In [12]:

saleCount = cropData['Price'].groupby(cropData.index.month).aggregate(len)
plt.plot(saleCount)
plt.ylim([0, 500])
plt.xlabel('Month')
plt.ylabel('Total sales')
plt.show()

Getting location data¶

Currently I have a list of postcodes, and need to turn these into geographical data. For this there's a handy api at Postcodes.io and I'll call it with the Python Requests library.

First off, I needed to learn how to use the api (and Requests). But this being Python, it turns out to be pleasingly simple!

In [21]:

# Import some more libraries
import requests
import json

## Practice on a couple of postcodes
# Form the request
url = 'http://api.postcodes.io/postcodes'
payload = {'postcodes' : ['OX49 5NU', 'M32 0JG'] } # Example postcodes
headers = {'content-type': 'application/json'}

# Make the request
r = requests.post(url, data=json.dumps(payload),  headers=headers)
returnData = r.json()

# Print out the result, with some nicer formatting
print json.dumps(returnData, sort_keys=True, indent=4, separators=(',', ': '))

{
    "result": [
        {
            "query": "OX49 5NU",
            "result": {
                "admin_county": "Oxfordshire",
                "admin_district": "South Oxfordshire",
                "admin_ward": "Benson",
                "ccg": "NHS Oxfordshire",
                "country": "England",
                "eastings": 464435,
                "european_electoral_region": "South East",
                "incode": "5NU",
                "latitude": 51.6562791294687,
                "longitude": -1.06993881320412,
                "lsoa": "South Oxfordshire 011B",
                "msoa": "South Oxfordshire 011",
                "nhs_ha": "South Central",
                "northings": 195686,
                "nuts": null,
                "outcode": "OX49",
                "parish": "South Oxfordshire",
                "parliamentary_constituency": "Henley",
                "postcode": "OX49 5NU",
                "primary_care_trust": "Oxfordshire",
                "quality": 1,
                "region": "South East"
            }
        },
        {
            "query": "M32 0JG",
            "result": {
                "admin_county": null,
                "admin_district": "Trafford",
                "admin_ward": "Gorse Hill",
                "ccg": "NHS Trafford",
                "country": "England",
                "eastings": 379988,
                "european_electoral_region": "North West",
                "incode": "0JG",
                "latitude": 53.4556572899372,
                "longitude": -2.30283674284007,
                "lsoa": "Trafford 003C",
                "msoa": "Trafford 003",
                "nhs_ha": "North West",
                "northings": 395476,
                "nuts": null,
                "outcode": "M32",
                "parish": null,
                "parliamentary_constituency": "Stretford and Urmston",
                "postcode": "M32 0JG",
                "primary_care_trust": "Trafford",
                "quality": 1,
                "region": "North West"
            }
        }
    ],
    "status": 200
}

Practice over¶

Cool! What I'm really interested in here are the latitude and longitude that corresponds to the postcode centre. It's easy to pull these out into a list with some comprhension (also making sure to handle any values that couldn't be successfully looked up).

The real thing¶

Now that I know how to get the lat/lon data from postcodes, I'll do the same for the entire dataset.

In [14]:

# Get the postcodes and assemble a list
postcodes = cropData['Postcode']
queryCodes = postcodes.tolist()

# We can only request 100 postcodes at a time
maxPostcodes = 100
nQueries = np.ceil(float(len(queryCodes))/maxPostcodes)
print('%d postcodes, requireing %d requests' % (len(queryCodes), nQueries))

# Assmble parts of the request which are constant
url = 'http://api.postcodes.io/postcodes'
headers = {'content-type': 'application/json'}

lats = []
longs = []

for iQuery in range(int(nQueries)):
    
    # Assemble the postcode query
    minQuery = iQuery*maxPostcodes
    maxQuery = min(((iQuery+1)*maxPostcodes,len(queryCodes)))
    
    payload = {'postcodes' : queryCodes[minQuery:maxQuery] }

    # Make the request
    r = requests.post(url, data=json.dumps(payload),  headers=headers)
    # Lets give the server a rest as it seems to get stressed
    time.sleep(0.1)
    if r.status_code is 200:
        returnData = r.json()
        sys.stdout.write('.')
    else:
        print('Query failed, status: %d' % r.status_code)

    result = returnData['result']
    # Lets comprehend the list to get the lat and log data
    # We need an if statement to catch cases where the postcode cant be found
    lats += [ result[idx]['result']['latitude'] if result[idx]['result'] is not None else np.nan for idx in range(len(result)) ]
    longs += [ result[idx]['result']['longitude'] if result[idx]['result'] is not None else np.nan for idx in range(len(result)) ]

print("Finished queries")

4124 postcodes, requireing 42 requests
..........................................Finished queries

Maps :(¶

Now I have some geographical price data itching to be plotted on a map. This turned out to be trickier than I expected.

First we need a map to plot the data on, Stamen Mapstack lets you export really nicely stylised maps (based on osm data), simple enough to be nice and legible, but containing all the street data needed to make sense of the house locations.

In [15]:

im=plt.imread('./res/map.png')
mapFac = 1.5
plt.figure(figsize = (mapFac*9.98, mapFac*6.7))
plt.imshow(im)
plt.axis('off')
plt.title('A pretty map of Cambridge:\n')
plt.show()

Latitude, longitude, pixels¶

Now we have a map in pixels, and some data with latitude and longitude co-ordinates. If I want to plot one on top of the other I need to make these match up nicely. I'm sure there are some tools that let you plot lat/lon data directly on a map, but I want to use the pretty map (as an image) from Stamen. I'm going to use Basemap to plot the points using a Mercator projection that should match the map, but I still need to convert from lat/lon to the pixel units of the map image.

Where's the box?¶

I made the map image by manually panning and zooming to what looked like an appropriate location. The output of Stamen Mapstack gives the centre latitude and longitude, and the zoom level. Finding this useful answer on Stackoverflow (after conversion to Python) lets me convert from these paramters to the bounding lat/lon box of the map.

In [16]:

def getTileNumber(lat,lon,zoom):
    
    xtile =  (lon+180)/360 * 2**zoom  ;
    ytile =  (1 - np.log(np.tan(np.radians(lat)) + 1/np.cos(np.radians(lat)))/np.pi)/2 * 2**zoom 
    return (xtile, ytile)


def getLonLat(xtile, ytile, zoom):
    n = 2 ** zoom
    lon_deg = xtile / n * 360.0 - 180.0
    lat_deg = np.degrees(np.arctan(np.sinh(np.pi * (1 - 2 * ytile / n))))
    return (lon_deg, lat_deg)

def latLon2Box(lat, lon, zoom, width, height):
    
    tile_size = 256.0

    (xtile, ytile) = getTileNumber(lat, lon, zoom)

    xtile_s = (xtile * tile_size - width/2) / tile_size;
    ytile_s = (ytile * tile_size - height/2) / tile_size;
    xtile_e = (xtile * tile_size + width/2) / tile_size;
    ytile_e = (ytile * tile_size + height/2) / tile_size;

    (lon_s, lat_s) = getLonLat(xtile_s, ytile_s, zoom);
    (lon_e, lat_e) = getLonLat(xtile_e, ytile_e, zoom);

    return (lon_s, lat_s, lon_e, lat_e)

(lon_s, lat_s, lon_e, lat_e) = latLon2Box(52.206344321410604, 0.13166427612304688, 13, 998, 670)

In [17]:

from mpl_toolkits.basemap import Basemap
m = Basemap(llcrnrlon=lon_s,llcrnrlat=lat_s,urcrnrlon=lon_e,urcrnrlat=lat_e,
            resolution='h',projection='merc')
x1,y1 = m(longs,lats)

xs, ys = m(lon_s, lat_s)
xe, ye = m(lon_e, lat_e)

Plot plot!¶

Now I can convert to pixel units, and try plotting the postcode data!

In [18]:

# Normalise to pixel co-ords
width = 998
height = 670

xPix = width*(np.array(x1) - xs)/(xe - xs)
yPix = height*(np.array(y1) - ys)/(ye - ys)

im=plt.imread('./res/map.png')

plt.figure(figsize = (mapFac*9.98, mapFac*6.7))
plt.imshow(np.flipud(im))
plt.plot(xPix,height-yPix,'o',markersize=5, markeredgecolor='none')
plt.xlim([0, width])
plt.ylim([0, height])
plt.axis('off')
plt.show()

A real plot¶

So plotting location data, with price encoded by colour, seems like a reasonable thing to do. I'll also add a bit of dithering, to separate the overlapping points, and I guess it reflects the uncertainty in the postcode based locations.

In [19]:

maxPrice = 600000.0
import matplotlib as mpl

price = cropData['Price']
price = price.tolist()
price = np.divide(price,maxPrice)
price[price>1] = 1

ditherMag = 10

fig = plt.figure(figsize = (mapFac*9.98, mapFac*6.7))

cmap = mpl.cm.cool

plt.imshow(np.flipud(im))
plt.scatter(xPix + ditherMag*(np.random.rand(xPix.size)-0.5),
            height - (yPix +ditherMag*(np.random.rand(yPix.size)-0.5)),
                      c=price, cmap=cmap, s=60)
plt.xlim([0, width])
plt.ylim([0, height])
plt.axis('off')
plt.title('House sales, Cambridge 2014 (Pink = expensive, Blue = cheap)\n')
plt.savefig('res/price_map.jpg')

Done!¶

It worked, and was pretty easy thanks to many useful tools that cleverer people have written and explained how to use.

Now to try and make something informative...