How many routes have how many trips?
The first thing that I wanted to do was to see which routes have more trips, which would give us a rough estimate of which routes are most busy. Of course, this has a few issues (some routes with more trips may simply be shorter in length) but it's a good start. We don't need Stops
yet, technically, but it's nice to get our class definitions out of the way.
Let's get our imports out of the way, and define what routes, trips, and stops are.
%matplotlib inline
import csv, math, pickle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from classes import Route, Trip, Stop # just classes to store the data
from pandas.stats.api import ols
import pymongo
from pymongo import MongoClient, ASCENDING, DESCENDING
client = MongoClient()
db = client.datasummative # my name for the database
Download the data from here, extract, and place the files in a folder called "schedules". In it, you should see the files:
$ ls schedules/
agency.txt routes.txt stops.txt trips.txt
calendar.txt shapes.txt stop_times.txt
We don't use all of these files, but it's good to have them nonetheless.
Let's start by opening and parsing the routes.txt
file. This should give us a list of all the routes in the TTC, which we can use for later analysis.
routes_list = {}
with open('schedules/routes.txt') as routes_file:
routes_csv = csv.reader(routes_file)
next(routes_csv) # skip header
for route in routes_csv:
route_id =int(route[0])
route_short_name = route[2]
route_long_name = route[3]
route_type = int(route[5])
routes_list[route_id] = Route(route_id, route_short_name,
route_long_name, route_type)
Next the trips.txt
file. This'll tell us what trips each route has. For each trip we find, we'll attach it to a list associated with the route it belongs to.
A trip is one vehicle departing from its home station at one point specific point in time.
with open('schedules/trips.txt') as trips_file:
trips_csv = csv.reader(trips_file)
next(trips_file)
for trip in trips_csv:
route_id = int(trip[0])
service_id =int(trip[1])
trip_id = int(trip[2])
trip_headsign = trip[3]
route = routes_list[route_id]
route.trips[trip_id] = Trip(trip_id, service_id, trip_headsign)
Now let's go ahead and analyze the data we have. We'll start by putting together tuples of routes and the number of trips they have, then plotting them to see their frequencies. In other words, we'll count the number of trips each route has.
## Distribution of trips vs routes
trips_freq = []
for route in routes_list.values():
trips_freq.append((route.route_short_name, len(route.trips)))
with open('trips_routes.pkl', 'wb') as trips_routes_file:
pickle.dump(trips_freq, trips_routes_file)
trips_series = pd.Series([x[1] for x in trips_freq])
print(trips_series.describe())
count 182.000000 mean 689.664835 std 622.333137 min 13.000000 25% 275.750000 50% 554.000000 75% 945.000000 max 4696.000000 dtype: float64
From this data, we see that there is a total of 182 routes, the mean route has ~690 trips, the route with the fewest trips has 13 trips, the route with the most trips has 4696 trips and the median route has 554 trips, all over a six week period.
That's pretty interesting, but still doesn't give us very much. Let's add in the definition of Pearson's Index, interquartile range, range, and find out whether we have any outliers, according to the formulae we learned in class.
def pearsons_index(series):
return 3 * (series.mean() - series.median()) / series.std()
def data_range(series):
return series.quantile(1) - series.quantile(0)
def interquartile_range(series):
return series.quantile(.75) - series.quantile(.25)
def outlier_range(series):
iqr = interquartile_range(series)
return (series.quantile(.25) - 1.5 * iqr, series.quantile(.75) + 1.5 * iqr)
print("Pearson's: %s" % pearsons_index(trips_series))
print("Range: %s" % data_range(trips_series))
print("IQR: %s" % interquartile_range(trips_series))
print("Outliers lie outside: (%s,%s)" % outlier_range(trips_series))
Pearson's: 0.653981736266 Range: 4683 IQR: 669.25 Outliers lie outside: (-728.125,1948.875)
According to the Pearson's Index, it would appear that the distribution of trips over routes is right-skewed, but not very much. Let's see the plot to make sure.
fig, ax = plt.subplots(1,1)
n, bins, patches = ax.hist([x[1] for x in trips_freq], bins=25,
align='mid', range=(0, 5000))
plt.xlabel('Number of Trips')
plt.ylabel('Number of Routes')
plt.title('Distribution of trips over routes')
plt.xticks(rotation=70)
ax.set_xticks(bins[:-1])
plt.xticks(bins[:-1],['%.1f' % (x-0.5) for x in bins[:-1]])
plt.subplots_adjust(bottom=0.2)
There appears to be a few outliers, and it also appears to be obviously skewed. Let's remove these outliers.
outlier_r = outlier_range(trips_series)
outliers = filter(lambda x: x[1] < outlier_r[0] or x[1] > outlier_r[1], trips_freq)
print(list(outliers))
[('29', 2224), ('32', 2011), ('501', 2705), ('504', 2025), ('509', 2059), ('510', 4696), ('52', 2043), ('BLR', 2181), ('YUS', 2267)]
Of the outlier routes, two are subways, four are streetcars and three are buses. Let's see what effect removing these outliers from our data has.
trips_series = pd.Series([x[1] for x in trips_freq if x[1] >= outlier_r[0] and x[1] <= outlier_r[1]])
print(trips_series.describe())
print("Pearson's: %s" % pearsons_index(trips_series))
print("IQR: %s" % interquartile_range(trips_series))
count 173.000000 mean 597.156069 std 445.952944 min 13.000000 25% 261.000000 50% 525.000000 75% 881.000000 max 1814.000000 dtype: float64 Pearson's: 0.485405940789 IQR: 620.0
As expected, Pearson's index and the IQR decreased as well when we removed the outliers. Let's check the plot:
fig, ax = plt.subplots(1,1)
n, bins, patches = ax.hist(trips_series, bins=25,
align='mid', range=(0, 2000))
plt.xlabel('Number of Trips')
plt.ylabel('Number of Routes')
plt.title('Distribution of trips over routes')
plt.xticks(rotation=70)
plt.xticks(bins[:-1],['%.1f' % (x-0.5) for x in bins[:-1]])
plt.subplots_adjust(bottom=0.2)
Despite the large mode on the left of the histogram, the Pearson's Index indicates that the data is not as skewed as it seems. As well, since only 9 of the 182 routes are outliers, the data can be considered non-skewed. For us, this means that while routes vary by up up to 1801 trips over six weeks, most routes are fairly close to the median of 554 trips over a six week period.
How is the daily cost of a route related to its ridership?
ridership_collection = db.ridership
rts = []
riders = []
costs = []
for ridership_row in ridership_collection.find():
rts.append(ridership_row['route'])
riders.append(ridership_row['ridership'])
costs.append(ridership_row['cost'])
riders_costs_df = pd.DataFrame({'route': rts, 'ridership': riders, 'cost': costs},
index=rts)
del rts
del costs
del riders
print(ols(x=riders_costs_df['ridership'], y=riders_costs_df['cost']))
-------------------------Summary of Regression Analysis------------------------- Formula: Y ~ <x> + <intercept> Number of Observations: 138 Number of Degrees of Freedom: 2 R-squared: 0.9276 Adj R-squared: 0.9270 Rmse: 7045.0030 F-stat (1, 136): 1741.2502, p-value: 0.0000 Degrees of Freedom: model 1, resid 136 -----------------------Summary of Estimated Coefficients------------------------ Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5% -------------------------------------------------------------------------------- x 2.0056 0.0481 41.73 0.0000 1.9114 2.0998 intercept 2478.4971 815.2188 3.04 0.0028 880.6683 4076.3258 ---------------------------------End of Summary---------------------------------
# define a convenient way to graph some stuff, find its regression, and graph that too
def plot_with_reg(x, y):
plt.hold(True)
plt.plot(x,y, 'b.')
res = ols(y=y, x=x)
m = res.beta['x']
b = res.beta['intercept']
x_fit = np.linspace(min(x), max(x))
y_fit = np.poly1d((m,b))
plt.plot(x_fit, y_fit(x_fit), '-k')
plt.hold(False)
plot_with_reg(x=riders_costs_df['ridership'], y=riders_costs_df['cost'])
t = plt.title("Cost of a Route vs. Number of Riders")
t = plt.xlabel("Number of Customers per Weekday")
t = plt.ylabel("Cost per Weekday ($)")
The regression here indicates (expectedly) that the cost of a route is strongly positively correlated with the number of daily riders it serves. We can interpret the coefficients as indicating that every additional rider adds a cost of $2 to the route, with a fixed cost of ~$2500.
How are routes, trips, and types of vehicles related?
The data also tells us what type of vehicles each route has. For the TTC, this means either streetcar, subway or bus.
# from https://developers.google.com/transit/gtfs/reference#routes_fields
TYPE_SUBWAY = 1
TYPE_STREETCAR = 0
TYPE_BUS = 3
buses = 0
subways = 0
streetcars = 0
for route in routes_list.values():
if route.route_type == TYPE_SUBWAY:
subways += 1
elif route.route_type == TYPE_STREETCAR:
streetcars += 1
elif route.route_type == TYPE_BUS:
buses += 1
print("Bus routes: %s" % buses)
print("Subway routes: %s" % subways)
print("Streetcar routes: %s" % streetcars)
Bus routes: 166 Subway routes: 4 Streetcar routes: 12
This would be most useful as a pie chart.
plt.figure(figsize = (6,6))
p = plt.pie((buses, streetcars, subways), labels=("Bus Routes", "Streetcar Routes", "Subway Routes"),
startangle=90, autopct="%1.1f%%")
t = plt.title("Vehicles in the TTC by Number of Routes")
It would also be useful to look at the vehicles by trips.
subways_trips = 0
buses_trips = 0
streetcars_trips = 0
trips_map = dict(trips_freq)
for route in routes_list:
route_type = routes_list[route].route_type
route_name = routes_list[route].route_short_name
if route_type == TYPE_SUBWAY:
subways_trips += trips_map[route_name]
elif route_type == TYPE_BUS:
buses_trips += trips_map[route_name]
elif route_type == TYPE_STREETCAR:
streetcars_trips += trips_map[route_name]
print("Bus trips: %s" % buses_trips)
print("Subway trips: %s" % subways_trips)
print("Streetcar trips: %s" % streetcars_trips)
plt.figure(figsize=(6,6))
p = plt.pie((buses_trips, streetcars_trips, subways_trips), labels=("Bus Trips", "Streetcar Trips",
"Subway Trips"),
startangle=90, autopct="%1.1f%%")
t = plt.title("Vehicles in the TTC by number of trips")
Bus trips: 101427 Subway trips: 7586 Streetcar trips: 16506
Evidently, the order remains, but buses retain a smaller majority and both subways and streetcars take a higher portion of the pie.
Next, we can see which vehicles average more trips per route. This is best done in a bar chart:
streetcar_trip_route = streetcars_trips / streetcars
buses_trip_route = buses_trips / buses
subway_trip_route = subways_trips / subways
print("Streetcars: %s trips per route" % streetcar_trip_route)
print("Buses: %s trips per route" % buses_trip_route)
print("Subways: %s trips per route" % subway_trip_route)
index = np.arange(3)
bar_width = .35
rect_buses = plt.bar(index, (buses_trip_route, streetcar_trip_route, subway_trip_route), bar_width)
plt.xlabel("Vehicle Type")
plt.ylabel("Average Trips per Route")
plt.title("Average Trips per Route by Vehicle Type")
ticks = plt.xticks(index + bar_width / 2, ("Buses", "Streetcars", "Subways"))
Streetcars: 1375.5 trips per route Buses: 611.0060240963855 trips per route Subways: 1896.5 trips per route
Clearly, subways run the most trips per route, followed by streetcars, followed by buses. This is a loose indication of how busy each type of vehicle is in the TTC. This makes sense, since subways are worked pretty much continuously over the day, and buses are 'resting' the most over a day.
The TTC has over 10,000 stop locations. Where are they?
stops_list = {}
with open('schedules/stops.txt') as stops_file:
stops_csv = csv.reader(stops_file)
next(stops_file)
for stop in stops_csv:
stop_id = int(stop[0])
stop_name = stop[2]
stop_lat = float(stop[4])
stop_lon = float(stop[5])
stops_list[stop_id] = Stop(stop_id, stop_name, stop_lat, stop_lon)
lats = [x.stop_lat for x in stops_list.values()]
longs = [x.stop_lon for x in stops_list.values()]
plt.figure(figsize=(8,6))
plt.scatter(longs, lats, marker='.', c='b')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
t = plt.title('Geographic Distribution of All Stops')
It's interesting how obvious the slant of Toronto's streets from GPS grid is. When we say we're going north on a street, we're likely going slightly to the west as well. There's also an obvious gap of stops along the Don River, which we can follow as it travels to (from?) Lake Ontario.
Let's take the mean and median of these points to approximate the geographic centre of Toronto.
longs_pd = pd.Series(longs)
lats_pd = pd.Series(lats)
fig, ax = plt.subplots(1,1)
n, bins, patches = ax.hist(longs_pd, align='mid', bins=25)
plt.xlabel('Longitude')
plt.ylabel('Number of stops')
plt.title('Longitude of Stops')
plt.xticks(rotation=70)
plt.xticks(bins[:-1],['%.3f' % (x-0.5) for x in bins[:-1]])
plt.subplots_adjust(bottom=0.2)
print("Longitudes: ", longs_pd.describe())
Longitudes: count 10738.000000 mean -79.399878 std 0.110536 min -79.650347 25% -79.489651 50% -79.400623 75% -79.313075 max -79.123046 dtype: float64
fig, ax = plt.subplots(1,1)
n, bins, patches = ax.hist(lats_pd, align='mid', bins=25)
plt.xlabel('Latitude')
plt.ylabel('Number of stops')
plt.title('Latitude of Stops')
plt.xticks(rotation=70)
plt.xticks(bins[:-1],['%.3f' % (x-0.5) for x in bins[:-1]])
plt.subplots_adjust(bottom=0.2)
print("Latitudes: ", lats_pd.describe())
Latitudes: count 10738.000000 mean 43.725382 std 0.060305 min 43.591810 25% 43.677031 50% 43.721847 75% 43.771064 max 43.914362 dtype: float64
Interestingly, the standard deviation of the longitudes is significantly larger than the standard deviation of the latitudes. This indicates that Toronto is a city that is wider on the east-west axes than the north-south axes. Both are roughly mound-shaped and symmetric, however.
For us, this means that as long as you don't live in the fringes of Toronto, such as very far north or very far east/west, there is an adquate supply of bus stops.
What does real-time vehicle location reveal?
Thanks to NVAS, we can keep track of each bus or streetcar (but not subway) by simply querying a server. The service requires a route per request, so in order to collect data for the entire TTC, 178 requests (one "meta-request") had to be made. The data was saved into a Python pickle file.
The 178 requests took approximately 44 seconds and was run every two minutes for three weeks. Currently, there are a total of 10,126 pickle files, each containing the complete information of a meta-request.
The script to request the data and save it to a file is found in busLocations.py
. For the sake of the filesystem, I've written a script to import this data into MongoDB, which can be found in busLocationsMongo.py
.
Each data point in the database consists of:
Let's count how many we have, and look at one specific data point.
collection = db.vehicles # the vehicles collection in the database
print("Number of points: %s" % collection.count())
print("An arbitrarily selected point: %s" % collection.find_one())
Number of points: 9420806 An arbitrarily selected point: {'routeTag': '191', '_id': ObjectId('536bedf4920c0e21c26c2e79'), 'dirTag': '191_1_191', 'heading': 71, 'vId': 7945, 'lat': 43.735985, 'timestamp': 1398694091674, 'long': -79.595016}
Wow! Over 9.42 million data points! There's a tremendous amount we could do with this data. Let's start by aggregating the headings of the vehicles. It's hard to say that heading is a discrete variable, but in this case we'll treat it as one since there are only 360 possible options. Fortunately, MongoDB's aggregate function makes this easy.
headings = collection.aggregate([{"$group":{ '_id': '$heading',
'count': {'$sum': 1}}},
{"$match": {'_id': {'$gte': 0}}}])
headings = headings['result']
Let's plot this on a polar bar chart, which is the obvious way to visualize this.
theta = []
r = []
for heading in sorted(headings, key=lambda x: x['_id']):
theta.append(heading['_id'] * np.pi / 180)
r.append(heading['count'])
ax = plt.subplot(111, polar=True)
ax.plot(theta, r)
ax.grid(True)
ax.set_theta_direction(-1)
ax.set_theta_zero_location('N')
plt.title("Headings of TTC vehicles.")
print("Maximum: %s"% max(r))
print("Values on two sides of maximum:",r[r.index(max(r)) -1 ], r[r.index(max(r)) + 1])
print("Angle of maximum (radians): ", theta[r.index(max(r))])
print("Total count: ", sum(r))
Maximum: 830814 Values on two sides of maximum: 3654 6356 Angle of maximum (radians): 3.7699111843077517 Total count: 8635317
The strange spike at 216 degrees can probably be attributed to instrumentation error, since it was three orders of magnitude larger than the values at 215 and 217 degrees. Let's remove it from our dataset and try again.
spike_idx = r.index(max(r))
del r[spike_idx]
del theta[spike_idx]
ax = plt.subplot(111, polar=True)
ax.plot(theta, r)
ax.grid(True)
ax.set_theta_direction(-1)
ax.set_theta_zero_location('N')
ti = plt.title("Headings of TTC vehicles.")
As expected, there are spikes heading north, east, south and west but very very little in between. The slight offset is likely due to Toronto's streets having an inclination, as seen in the geographic scatter plot of the buses. Another thing that's interesting, but also expected, is that the graph is mostly symmetrical -- vehicles spend the same amount of time heading east as they do west, and heading north as they do south. This also means that on average, vehicles travel at the same speed heading east and west, and north and south.
One item of note is that the east and west spikes are much larger than the north and south spikes. This indicates, like we saw earlier with the larger standard deviation in longitude, that Toronto's east-west axis is longer than Toronto's north-south axis.
Let's load it into Pandas for easier analysis.
headings_dataframe = pd.DataFrame(data={"heading":[x * 180 / np.pi for x in theta],
"count": r})
print(headings_dataframe[:10])
count heading 0 8722 0 1 6229 1 2 5606 2 3 7725 3 4 10014 4 5 5448 5 6 8428 6 7 7050 7 8 4323 8 9 4621 9
Obviously, there are more rows in this data but I truncated it for brevity. To find the mean, median and mode, we will have to implement those methods ourselves.
total_headings = headings_dataframe['count'].sum()
# median
median_idx = total_headings // 2
current_idx = 0
for heading in range(len(headings_dataframe)):
current_idx += headings_dataframe.iloc[heading]['count']
if current_idx >= median_idx:
median = headings_dataframe.iloc[heading]['heading']
break
# mean
s = 0
for heading in range(len(headings_dataframe)):
s += headings_dataframe.iloc[heading]['count'] * headings_dataframe.iloc[heading]['heading']
mean = s / total_headings
# standard deviation
s = 0
for heading in headings_dataframe.iterrows():
n = heading[1]['count']
h = heading[1]['heading']
s += n * (h - mean) ** 2
stdev = math.sqrt((1/total_headings) * s)
# mode
modes = []
max_heading = max(headings_dataframe['count'])
modes = [x[1]['heading'] for x in headings_dataframe.iterrows() if x[1]['count'] == max_heading]
print("Mean: %.3f"% mean)
print("Standard deviation: %.3f" % stdev)
print("Median: %.0f"% median)
print("Modes: %.0f" % modes[0])
Mean: 197.525 Standard deviation: 102.488 Median: 179 Modes: 253
As expected, the mean and median are around 180 degrees, since the entire circle encompasses 360 degrees and the points are equally spaced out. The mode is 253 degrees, which could be expected since it is the west spike. The east spike is likely the second-most-popular option. The standard deviation here is not particularly useful.
What happens to the number of active vehicles over time?
import pickle
from matplotlib.dates import HourLocator, DateFormatter, DayLocator
from datetime import datetime
from pytz import timezone
import pytz
tz = timezone('America/Toronto')
with open('time_dist.pickle', 'rb') as time_dist_pickle:
time_dist = pickle.load(time_dist_pickle)[1:]
times, dist = zip(*time_dist)
# note that the timestamp from the pickle file already has timezone data
date_times = [datetime.fromtimestamp(x) for x in times]
fig, axes = plt.subplots()
fig.set_size_inches(8,6)
days = DayLocator()
hours = HourLocator(interval=8)
formatter = DateFormatter('%b %d')
axes.plot_date(date_times,dist, 'b-')
axes.xaxis.set_major_locator(days)
axes.xaxis.set_minor_locator(hours)
axes.xaxis.set_major_formatter(formatter)
axes.set_xlabel('Date and Time')
axes.set_ylabel('Number of vehicles on duty')
axes.set_title('Number of TTC vehicles on duty from April 27th to May 4th')
plt.show()
Note that the lables indicate the start of the day, so the data here starts the morning of Monday, April 28th and ends shortly after Monday, May 5th. Also note that the small gaps on April 30th can be attributed to a server error.
Even without looking at the exact times, there are many more buses and streetcars on the road during rush hour on weekdays than there are on weekends, and much more than there are at night. Weekdays follow a bimodal distribution, whereas weekends follow a modal distribution. Also, there are consistently more vehicles active during morning rush hour than evening rush hour on weekdays.
At the peak of Saturday service, there are only as just as many vehicles as there are during the least busy hours of weekdays. There are even fewer active vehicles during peak hours on Sundays.
What velocities do TTC vehicles travel at, and what are some factors that affect it?
#velocity_collection = db.velocity
#velocities = []
#for idx, velocity_point in enumerate(velocity_collection.find(sort=[('timestamp_orig', ASCENDING)])):
# if idx % 1000 == 0:
# print(idx, end='\r')
# timestamp = velocity_point['timestamp_orig']
# v = math.sqrt(velocity_point['vx'] ** 2 + velocity_point['vy'] ** 2)
# velocities.append((timestamp, v))
with open('velocities.pkl', 'rb') as velocities_file:
velocities = pickle.load(velocities_file)
date_times, dist = zip(*velocities)
del velocities
n, bins, patchs = plt.hist(dist, bins=25, range=(0,20))
t = plt.title('Aggregated TTC Vehicles Velocities')
plt.xlabel('Velocity (m/s)')
plt.ylabel("Number of Vehicle-Time Points")
<matplotlib.text.Text at 0x7f4839712518>
velocities_pd = pd.DataFrame({'date_times': date_times,
'velocity': dist})
print(velocities_pd['velocity'].describe())
print("IQR:", interquartile_range(velocities_pd['velocity']))
print("Pearson's Index: ", pearsons_index(velocities_pd['velocity']))
print("Range: ", data_range(velocities_pd['velocity']))
count 8735310.000000 mean 5.363947 std 19.412805 min 0.000000 25% 2.230395 50% 4.529046 75% 6.747218 max 11713.119951 dtype: float64 IQR: 4.51682350164 Pearson's Index: 0.129023176137 Range: 11713.1199507
Clearly, a vehicle with a velocity of 11,713 m/s makes no sense (unless the TTC has a space program. For comparison, the ISS orbits at 7,000 m/s). Like before, let's take out the outliers.
outlier_b, outlier_t = outlier_range(velocities_pd['velocity'])
velocities_series = pd.Series(
[x for x in velocities_pd['velocity'] if x >= outlier_b and x <= outlier_t])
n, bins, patchs = plt.hist(velocities_series, bins=25, range=(0,20))
t = plt.title('Aggregated TTC Vehicles Velocities')
plt.xlabel('Velocity (m/s)')
plt.ylabel("Number of Vehicle-Time Points")
print(velocities_series.describe())
print("IQR:", interquartile_range(velocities_series))
print("Pearson's Index: ", pearsons_index(velocities_series))
print("Range: ", data_range(velocities_series))
count 8617369.000000 mean 4.550967 std 3.079863 min 0.000000 25% 2.187578 50% 4.475319 75% 6.643428 max 13.522408 dtype: float64 IQR: 4.45584987475 Pearson's Index: 0.0736866892983 Range: 13.5224079069
As Pearson's Index indicates, there is slight right-skewing, but not very much. The range of the data is surprisingly small (after outliers were removed), from 0m/s (0km/h) to 13.5m/s (48.6km/h). Overall, the data is mostly mound-shaped and normal, except for a strong mode at 0 (likely due to resting vehicles / drivers) and a slight right skew.
How does velocity of TTC vehicles change over time?
velocity_moving_average = ([],[])
i = 0
while i < len(velocities_pd):
start_point = velocities_pd.iloc[i]
t_start = start_point['date_times']
t_end = t_start + 1000 * 60 * 15 # fifteen minute moving averages
s = 0
n = 0
while i < len(velocities_pd) and velocities_pd['date_times'][i] < t_end:
this_point = velocities_pd.iloc[i]
v = this_point['velocity']
if v <= outlier_t and v >= outlier_b:
s += v
n += 1
i += 100 # sample every 100th point for speed
if n != 0:
avg_vel = s / n
velocity_moving_average[0].append(t_start)
velocity_moving_average[1].append(avg_vel)
i += 1
date_times_gen = [datetime.fromtimestamp(x // 1000) for x in velocity_moving_average[0]]
fig, axes = plt.subplots()
fig.set_size_inches(8,6)
days = DayLocator(interval=2)
formatter = DateFormatter('%b %d')
axes.plot_date(date_times_gen,velocity_moving_average[1],'b-')
axes.xaxis.set_major_locator(days)
axes.xaxis.set_major_formatter(formatter)
axes.set_xlabel('Date')
axes.set_ylabel('Average Velocity (m/s)')
axes.set_title('Velocity of TTC Vehicles Measured Over Time')
plt.show()
This clearly shows that the moving average velocity of vehicles is periodic, but it's hard to see from this plot when. Let's look specifically at the time of day and ignore the date.
times_gen = [x.time() for x in date_times_gen]
fig, axes = plt.subplots()
fig.set_size_inches(8,6)
axes.plot(times_gen,velocity_moving_average[1], 'b.')
axes.set_xlabel('Time')
axes.set_ylabel('Average Velocity (m/s)')
axes.set_title('Velocity of TTC Vehicles Measured Over a Day')
plt.show()
Interestingly, the TTC has its vehicles at the fastest very late at night. This is likely because there is little traffic and also much fewer other TTC vehicles, so they do not have to maintain as far of a spacing. As well, there are much fewer people at these hours, so vehicles do not have to make as frequent stops. However, there is also much higher variance, likely due to vehicles being ahead of schedule and making long stops in order to synchronize.
Overall through the day, TTC vehicles appear to be mostly uniform in speed, with an overall median speed of 4.5 m/s, or approximately 16km/h. This is remarkably slower than expected, since most streets have a speed limit of between 40km/h and 80km/h.
Is there a correlation between velocity and busyness? (Measured by number of active vehicles.)
active_vehicles = db.active_vehicles
bus_vel = ([], [])
for i in range(len(velocity_moving_average[0])):
ts = velocity_moving_average[0][i] // 1000
vel = velocity_moving_average[1][i]
number_active = active_vehicles.find_one({'timestamp':
{'$gte': ts}},
sort=[('timestamp',ASCENDING)])
n = number_active['count']
bus_vel[0].append(n)
bus_vel[1].append(vel)
p = plt.plot(bus_vel[0], bus_vel[1], 'b.')
plt.xlabel("Number of Active Vehicles")
plt.ylabel("Average Velocity of Active Vehicles (m/s)")
t = plt.title("Average Velocity of Active Vehicles vs Number of Active Vehicles")
From the scatter plot, it looks like there is a weak negative correlation between the velocity of active vehicles and the number of active vehicles at a given time. Let's verify this by performing regression.
velocity_active_df = pd.DataFrame({"active": bus_vel[0], "vel": bus_vel[1]})
res = ols(y=velocity_active_df['vel'], x = velocity_active_df['active'])
print(res)
-------------------------Summary of Regression Analysis------------------------- Formula: Y ~ <x> + <intercept> Number of Observations: 1273 Number of Degrees of Freedom: 2 R-squared: 0.2501 Adj R-squared: 0.2495 Rmse: 0.6348 F-stat (1, 1271): 423.9162, p-value: 0.0000 Degrees of Freedom: model 1, resid 1271 -----------------------Summary of Estimated Coefficients------------------------ Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5% -------------------------------------------------------------------------------- x -0.0007 0.0000 -20.59 0.0000 -0.0008 -0.0007 intercept 5.4518 0.0381 143.14 0.0000 5.3771 5.5264 ---------------------------------End of Summary---------------------------------
The coefficient of determination (R-squared) is 0.2501, which indicates a very weak fit. The coefficient in front of the x variable is negative, which indicates that it is a very weak negative correlation. This can also be interpreted as the velocities of the vehicles can be 25.01% explained by the number of active vehicles in the system.
This is actually an example of a confounding variable. The correlation between the average velocities of the vehicles and the number of active vehicles is likely due to having more vehicles in the middle of the day, where traffic is likely higher.
plt.hold(True)
p = plt.plot(bus_vel[0], bus_vel[1], 'b.')
plt.xlabel("Number of Active Vehicles")
plt.ylabel("Average Velocity of Active Vehicles (m/s)")
t = plt.title("Average Velocity of Active Vehicles vs Number of Active Vehicles")
m, b = -0.0007, 5.4518
x = np.arange(0, 2000, 100)
p = plt.plot(x, np.poly1d((m, b))(x), '-k')
plt.hold(False)