An IPython Notebook for generating post code data relating to the location of libraries participating in the Access to Research program. The code scrapes the Access to Research site for the URLs of participating libraries and then attempts to find postcodes at the library websites. The map made from this data is available at Google Maps Engine. More details on the motivation and process are available in this post.
#imports
import requests
import string
from bs4 import BeautifulSoup
import re
import csv
Grab the Access to Research page with the libraries information
page = requests.get('http://www.accesstoresearch.org.uk/libraries')
soup = BeautifulSoup(page.text)
librarylist = soup.find('div', class_='col-lft')
libraries = librarylist.find_all('ul')
We can find the number of libraries and look at the structure of the information on each one. Once we know the structure of each library element we can easily pull out the name and URL for each one.
print len(libraries)
print libraries[0]
235 <ul class="list letters"><li class="letter-a"><a href="http://www.oxfordshire.gov.uk/cms/content/abingdon-library" target="_blank">Abingdon</a></li></ul>
libs = []
for library in libraries:
name = library.find('a').text
url = library.find('a')['href']
libs.append([name, url])
Just check that we've got that working properly by printing out the first one.
print libs[0]
[u'Abingdon', u'http://www.oxfordshire.gov.uk/cms/content/abingdon-library']
For some County Council websites it seems there are multiple post codes on the library pages. For instance both the library postcode and the council postcode. So for those we need to identify more specific structures within the page where we find the correct postcode.
To deal with this we need both a way of recognising a valid UK postcode (via regex) and some specific processing for the irritating cases.
# The UK Postcode REGEX came from
# http://www.regxlib.com/REDetails.aspx?regexp_id=260
ukpc = re.compile('([A-PR-UWYZ0-9][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {1,2}[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)')
def process(raw_postcode,page):
"""
Given a postcode from the 'doesn't work' list, process page correctly
"""
soup = BeautifulSoup(page)
if raw_postcode == 'OX1 1ND':
elem = soup.find(['span','div'], class_='postal-code')
postcode = ukpc.search(elem.get_text()).group(0)
elif raw_postcode == 'ME14 1LQ':
elem = soup.find('span', id='ctl00__mainContent_uxPostcodeLabel')
postcode = ukpc.search(elem.get_text()).group(0)
elif raw_postcode == 'HX1 1UJ':
elem = soup.find_all('ul', class_='contactitem')[1]
postcode = ukpc.search(elem.get_text()).group(0)
return postcode
With all of that in hand we can now process our list, grabbing each webpage, finding the postcode, and if necessary checking that we've got the correct one.
tocorrect = ['OX1 1ND', 'ME14 1LQ', 'HX1 1UJ']
for library in libs:
page = requests.get(library[1])
m = ukpc.search(page.text)
if m:
postcode = m.group(0)
if postcode in tocorrect:
postcode = process(postcode, page.text)
library.append(postcode)
And finally we can write out the files. Because Google Mapsengine Lite (the free version) only allows 100 items in a given layer I've divided the list up into three as well as dumping the full set.
filename = 'libraries.csv'
with open(filename, 'w') as f:
writer = csv.writer(f)
writer.writerow(['Library', 'url', 'postcode'])
writer.writerows(libs)
for segment in [(0,100), (101, 200), (201,235)]:
filename = 'libraries%s.csv' % str(segment[0])
with open(filename, 'w') as f:
writer = csv.writer(f)
writer.writerow(['Library', 'url', 'postcode'])
writer.writerows(libs[segment[0]:segment[1]])
You can find the output data in the github repository which also hosts a version of this IPython Notebook. The map itself is live at Google Maps Engine where I used the free lite version. The post discussing this, and the irony that I've probably violated a whole range of text mining rules, not least those imposed on users of the Access to Research Service is available on my blog.
To the extent possible under law,
Cameron Neylon
has waived all copyright and related or neighboring rights to
Access to Research Map - IPython Notebook.