Last updated: Thu Nov 21 22:57:12 GMT 2013.
Source code for this notebook is at: https://gist.github.com/MHenderson/6279740
The notebook itself can be viewed at: http://nbviewer.ipython.org/6279740
Using pyproj
(http://code.google.com/p/pyproj/) we implement three functions for convenience.
distance(lat1, lng1, lat2, lng2)
computes the distance between points with latitudes lat1
, lat2
and longitudes lng1
, lng2
with respect to the WGS84 ellipsoid (http://en.wikipedia.org/wiki/World_Geodetic_System)distance_between(p1, p2)
computes the same distance but when the arguments p1
, p2
are latitude, longitude pairs.nearest_mile(distance_in_metres)
converts a distance in metres to a distance to the nearest mile.total_distance(points)
calculates the total distance between points in a list of points.import pyproj
def distance(lat1, lng1, lat2, lng2, ellps = 'WGS84'):
g = pyproj.Geod(ellps = ellps)
return g.inv(lng1, lat1, lng2, lat2)[2]
def distance_between(p1, p2):
return distance(p1[0], p1[1], p2[0], p2[1])
def nearest_mile(distance_in_metres):
return int(0.621371*distance_in_metres/1000)
def total_distance(points):
return sum(map(distance_between, points[:-1], points[1:]))
So now, for example, if we know that Nottingham, England has latitude and longitude (52.9548, -1.1581) and Louisville, Kentucky has latitude and longitude (38.253284, -85.758786) then we can compute the great circle distance between those two points by using the distance_between
function.
p1 = (52.9548, -1.1581) # Nottingham, England
p2 = (38.253284, -85.758786) # Louisville, KY
print "Distance (to the nearest mile): " + str(nearest_mile(distance_between(p1, p2)))
Distance (to the nearest mile): 3976
The data we are given is in CSV format (http://en.wikipedia.org/wiki/Comma-separated_values). Each row of our data gives GPS (in the columns headed latitude
and longitude
) data for a specific van (van_id
) at a specific time (timestamp
). We also have access to other information like the address
, speed
, heading
and so forth. To open a CSV file for inspection with Python we use the standard library module csv
which provides the DictReader
object which provides a dictionary interface to the CSV data. To instantiate a DictReader we need to provide the path the CSV file and a list of table headings.
data_dir_path = '/home/matthew/workspace/resources/G/Geographical Information Science/'
van_activity_csv_filename = 'van_activity.csv'
van_activity_csv_filename = 'gps-activity.csv'
van_activity_csv_path = data_dir_path + van_activity_csv_filename
labels = ['id','van_id','timestamp','latitude','longitude','type','address','speed','heading','created']
With this information we can create our DictReader
object:
import csv
csv_file = open(van_activity_csv_path, 'rb')
van_activity_reader = csv.DictReader(csv_file, labels, delimiter=',', quotechar='\"')
The keyword arguments delimeter
and quotechar
can be customised, for example to allow for tab seperated values.
We immediately advance the van_activity_reader
to the next value because the first row represent the headings and so we don't want to do any calculation with that data. After that we build a list of points by iterating over the remaining rows of the data.
van_activity_reader.next()
points = []
for van_activity in van_activity_reader:
points.append((van_activity['latitude'], van_activity['longitude']))
len(points)
143974
The ultimate task is to inspect the data for anomalies. The vans should be following the same routes on different days and, therefore, should follow certain routes day after day and return more or less the same data every day. We want to look for features in the data that will allow us to recognise automatically whether a van's activity is anomalous. To start with, we look at the total distance travelled by a given van on a given day.
print "Distance (to the nearest mile): " + str(nearest_mile(total_distance(points)))
Distance (to the nearest mile): 937658