Notebook

Routes and Trips¶

How many routes have how many trips?

The first thing that I wanted to do was to see which routes have more trips, which would give us a rough estimate of which routes are most busy. Of course, this has a few issues (some routes with more trips may simply be shorter in length) but it's a good start. We don't need Stops yet, technically, but it's nice to get our class definitions out of the way.

Let's get our imports out of the way, and define what routes, trips, and stops are.

In [39]:

%matplotlib inline
import csv, math, pickle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from classes import Route, Trip, Stop # just classes to store the data
from pandas.stats.api import ols
import pymongo
from pymongo import MongoClient, ASCENDING, DESCENDING

client = MongoClient()
db = client.datasummative # my name for the database

Download the data from here, extract, and place the files in a folder called "schedules". In it, you should see the files:

$ ls schedules/
agency.txt    routes.txt  stops.txt       trips.txt
calendar.txt  shapes.txt  stop_times.txt

We don't use all of these files, but it's good to have them nonetheless.

Let's start by opening and parsing the routes.txt file. This should give us a list of all the routes in the TTC, which we can use for later analysis.

In [2]:

routes_list = {}
with open('schedules/routes.txt') as routes_file: 
    routes_csv = csv.reader(routes_file)
    next(routes_csv)  # skip header
    for route in routes_csv:
        route_id =int(route[0])
        route_short_name = route[2]
        route_long_name = route[3]
        route_type = int(route[5])
        routes_list[route_id] = Route(route_id, route_short_name,
                                      route_long_name, route_type)

Next the trips.txt file. This'll tell us what trips each route has. For each trip we find, we'll attach it to a list associated with the route it belongs to.

A trip is one vehicle departing from its home station at one point specific point in time.

In [3]:

with open('schedules/trips.txt') as trips_file:
    trips_csv = csv.reader(trips_file)
    next(trips_file)
    for trip in trips_csv:
        route_id = int(trip[0])
        service_id =int(trip[1])
        trip_id = int(trip[2])
        trip_headsign = trip[3]
        route = routes_list[route_id]
        route.trips[trip_id] = Trip(trip_id, service_id, trip_headsign)

Now let's go ahead and analyze the data we have. We'll start by putting together tuples of routes and the number of trips they have, then plotting them to see their frequencies. In other words, we'll count the number of trips each route has.

In [4]:

## Distribution of trips vs routes
trips_freq = []
for route in routes_list.values():
    trips_freq.append((route.route_short_name, len(route.trips)))
with open('trips_routes.pkl', 'wb') as trips_routes_file:
    pickle.dump(trips_freq, trips_routes_file)
trips_series = pd.Series([x[1] for x in trips_freq])
print(trips_series.describe())

count     182.000000
mean      689.664835
std       622.333137
min        13.000000
25%       275.750000
50%       554.000000
75%       945.000000
max      4696.000000
dtype: float64

From this data, we see that there is a total of 182 routes, the mean route has ~690 trips, the route with the fewest trips has 13 trips, the route with the most trips has 4696 trips and the median route has 554 trips, all over a six week period.

That's pretty interesting, but still doesn't give us very much. Let's add in the definition of Pearson's Index, interquartile range, range, and find out whether we have any outliers, according to the formulae we learned in class.

In [5]:

def pearsons_index(series):
    return 3 * (series.mean() - series.median()) / series.std()
def data_range(series):
    return series.quantile(1) - series.quantile(0)
def interquartile_range(series):
    return series.quantile(.75) - series.quantile(.25)
def outlier_range(series):
    iqr = interquartile_range(series)
    return (series.quantile(.25) - 1.5 * iqr, series.quantile(.75) + 1.5 * iqr)

In [6]:

print("Pearson's: %s" % pearsons_index(trips_series))
print("Range: %s" % data_range(trips_series))
print("IQR: %s" % interquartile_range(trips_series))
print("Outliers lie outside: (%s,%s)" % outlier_range(trips_series))

Pearson's: 0.653981736266
Range: 4683
IQR: 669.25
Outliers lie outside: (-728.125,1948.875)

According to the Pearson's Index, it would appear that the distribution of trips over routes is right-skewed, but not very much. Let's see the plot to make sure.

In [7]:

fig, ax = plt.subplots(1,1)
n, bins, patches = ax.hist([x[1] for x in trips_freq], bins=25,
                            align='mid', range=(0, 5000))
plt.xlabel('Number of Trips')
plt.ylabel('Number of Routes')
plt.title('Distribution of trips over routes')
plt.xticks(rotation=70)
ax.set_xticks(bins[:-1])
plt.xticks(bins[:-1],['%.1f' % (x-0.5) for x in bins[:-1]])
plt.subplots_adjust(bottom=0.2)

There appears to be a few outliers, and it also appears to be obviously skewed. Let's remove these outliers.

In [8]:

outlier_r = outlier_range(trips_series)
outliers = filter(lambda x: x[1] < outlier_r[0] or x[1] > outlier_r[1], trips_freq)
print(list(outliers))

[('29', 2224), ('32', 2011), ('501', 2705), ('504', 2025), ('509', 2059), ('510', 4696), ('52', 2043), ('BLR', 2181), ('YUS', 2267)]

Of the outlier routes, two are subways, four are streetcars and three are buses. Let's see what effect removing these outliers from our data has.

In [9]:

trips_series = pd.Series([x[1] for x in trips_freq if x[1] >= outlier_r[0] and x[1] <= outlier_r[1]])
print(trips_series.describe())
print("Pearson's: %s" % pearsons_index(trips_series))
print("IQR: %s" % interquartile_range(trips_series))

count     173.000000
mean      597.156069
std       445.952944
min        13.000000
25%       261.000000
50%       525.000000
75%       881.000000
max      1814.000000
dtype: float64
Pearson's: 0.485405940789
IQR: 620.0

As expected, Pearson's index and the IQR decreased as well when we removed the outliers. Let's check the plot:

In [10]:

fig, ax = plt.subplots(1,1)
n, bins, patches = ax.hist(trips_series, bins=25,
                            align='mid', range=(0, 2000))
plt.xlabel('Number of Trips')
plt.ylabel('Number of Routes')
plt.title('Distribution of trips over routes')
plt.xticks(rotation=70)
plt.xticks(bins[:-1],['%.1f' % (x-0.5) for x in bins[:-1]])
plt.subplots_adjust(bottom=0.2)

Despite the large mode on the left of the histogram, the Pearson's Index indicates that the data is not as skewed as it seems. As well, since only 9 of the 182 routes are outliers, the data can be considered non-skewed. For us, this means that while routes vary by up up to 1801 trips over six weeks, most routes are fairly close to the median of 554 trips over a six week period.

Ridership and Costs¶

How is the daily cost of a route related to its ridership?

In [40]:

ridership_collection = db.ridership
rts = []
riders = []
costs = []
for ridership_row in ridership_collection.find():
    rts.append(ridership_row['route'])
    riders.append(ridership_row['ridership'])
    costs.append(ridership_row['cost'])
riders_costs_df = pd.DataFrame({'route': rts, 'ridership': riders, 'cost': costs},
                            index=rts)
del rts
del costs
del riders

In [42]:

print(ols(x=riders_costs_df['ridership'], y=riders_costs_df['cost']))

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         138
Number of Degrees of Freedom:   2

R-squared:         0.9276
Adj R-squared:     0.9270

Rmse:           7045.0030

F-stat (1, 136):  1741.2502, p-value:     0.0000

Degrees of Freedom: model 1, resid 136

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x     2.0056     0.0481      41.73     0.0000     1.9114     2.0998
     intercept  2478.4971   815.2188       3.04     0.0028   880.6683  4076.3258
---------------------------------End of Summary---------------------------------

In [43]:

# define a convenient way to graph some stuff, find its regression, and graph that too
def plot_with_reg(x, y):
    plt.hold(True)
    plt.plot(x,y, 'b.')
    res = ols(y=y, x=x)
    m = res.beta['x']
    b = res.beta['intercept']
    x_fit = np.linspace(min(x), max(x))
    y_fit = np.poly1d((m,b))
    plt.plot(x_fit, y_fit(x_fit), '-k')
    plt.hold(False)

In [47]:

plot_with_reg(x=riders_costs_df['ridership'],  y=riders_costs_df['cost'])
t = plt.title("Cost of a Route vs. Number of Riders")
t = plt.xlabel("Number of Customers per Weekday")
t = plt.ylabel("Cost per Weekday ($)")

The regression here indicates (expectedly) that the cost of a route is strongly positively correlated with the number of daily riders it serves. We can interpret the coefficients as indicating that every additional rider adds a cost of $2 to the route, with a fixed cost of ~$2500.

Types of Vehicles¶

How are routes, trips, and types of vehicles related?

The data also tells us what type of vehicles each route has. For the TTC, this means either streetcar, subway or bus.

In [11]:

# from https://developers.google.com/transit/gtfs/reference#routes_fields
TYPE_SUBWAY = 1
TYPE_STREETCAR = 0
TYPE_BUS = 3

buses = 0
subways = 0
streetcars = 0
for route in routes_list.values():
    if route.route_type == TYPE_SUBWAY:
        subways += 1
    elif route.route_type == TYPE_STREETCAR:
        streetcars += 1
    elif route.route_type == TYPE_BUS:
        buses += 1
print("Bus routes: %s" % buses)
print("Subway routes: %s" % subways)
print("Streetcar routes: %s" % streetcars)

Bus routes: 166
Subway routes: 4
Streetcar routes: 12

This would be most useful as a pie chart.

In [12]:

plt.figure(figsize = (6,6))
p = plt.pie((buses, streetcars, subways), labels=("Bus Routes", "Streetcar Routes", "Subway Routes"),
    startangle=90, autopct="%1.1f%%")
t = plt.title("Vehicles in the TTC by Number of Routes")

It would also be useful to look at the vehicles by trips.

In [13]:

subways_trips = 0
buses_trips = 0
streetcars_trips = 0

trips_map = dict(trips_freq)
for route in routes_list:
    route_type = routes_list[route].route_type
    route_name = routes_list[route].route_short_name
    if route_type == TYPE_SUBWAY:
        subways_trips += trips_map[route_name]
    elif route_type == TYPE_BUS:
        buses_trips += trips_map[route_name]
    elif route_type == TYPE_STREETCAR:
        streetcars_trips += trips_map[route_name]
print("Bus trips: %s" % buses_trips)
print("Subway trips: %s" % subways_trips)
print("Streetcar trips: %s" % streetcars_trips)
plt.figure(figsize=(6,6))
p = plt.pie((buses_trips, streetcars_trips, subways_trips), labels=("Bus Trips", "Streetcar Trips",
                                                                    "Subway Trips"),
            startangle=90, autopct="%1.1f%%")
t = plt.title("Vehicles in the TTC by number of trips")

Bus trips: 101427
Subway trips: 7586
Streetcar trips: 16506

Evidently, the order remains, but buses retain a smaller majority and both subways and streetcars take a higher portion of the pie.

Next, we can see which vehicles average more trips per route. This is best done in a bar chart:

In [14]:

streetcar_trip_route = streetcars_trips / streetcars
buses_trip_route = buses_trips / buses
subway_trip_route = subways_trips / subways
print("Streetcars: %s trips per route" % streetcar_trip_route)
print("Buses: %s trips per route" % buses_trip_route)
print("Subways: %s trips per route" % subway_trip_route)
index = np.arange(3)
bar_width = .35
rect_buses = plt.bar(index, (buses_trip_route, streetcar_trip_route, subway_trip_route), bar_width)
plt.xlabel("Vehicle Type")
plt.ylabel("Average Trips per Route")
plt.title("Average Trips per Route by Vehicle Type")
ticks = plt.xticks(index + bar_width / 2, ("Buses", "Streetcars", "Subways"))

Streetcars: 1375.5 trips per route
Buses: 611.0060240963855 trips per route
Subways: 1896.5 trips per route

Clearly, subways run the most trips per route, followed by streetcars, followed by buses. This is a loose indication of how busy each type of vehicle is in the TTC. This makes sense, since subways are worked pretty much continuously over the day, and buses are 'resting' the most over a day.

Geographic Distribution of Bus Stops¶

The TTC has over 10,000 stop locations. Where are they?

In [15]:

stops_list = {}

with open('schedules/stops.txt') as stops_file:
    stops_csv = csv.reader(stops_file)
    next(stops_file)
    for stop in stops_csv:
        stop_id = int(stop[0])
        stop_name = stop[2]
        stop_lat = float(stop[4])
        stop_lon = float(stop[5])
        stops_list[stop_id] = Stop(stop_id, stop_name, stop_lat, stop_lon)

lats = [x.stop_lat for x in stops_list.values()]
longs = [x.stop_lon for x in stops_list.values()]
plt.figure(figsize=(8,6))
plt.scatter(longs, lats, marker='.', c='b')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
t = plt.title('Geographic Distribution of All Stops')

It's interesting how obvious the slant of Toronto's streets from GPS grid is. When we say we're going north on a street, we're likely going slightly to the west as well. There's also an obvious gap of stops along the Don River, which we can follow as it travels to (from?) Lake Ontario.

Let's take the mean and median of these points to approximate the geographic centre of Toronto.

In [16]:

longs_pd = pd.Series(longs)
lats_pd = pd.Series(lats)
fig, ax = plt.subplots(1,1)
n, bins, patches = ax.hist(longs_pd, align='mid', bins=25)
plt.xlabel('Longitude')
plt.ylabel('Number of stops')
plt.title('Longitude of Stops')
plt.xticks(rotation=70)
plt.xticks(bins[:-1],['%.3f' % (x-0.5) for x in bins[:-1]])
plt.subplots_adjust(bottom=0.2)
print("Longitudes: ", longs_pd.describe())

Longitudes:  count    10738.000000
mean       -79.399878
std          0.110536
min        -79.650347
25%        -79.489651
50%        -79.400623
75%        -79.313075
max        -79.123046
dtype: float64

In [17]:

fig, ax = plt.subplots(1,1)
n, bins, patches = ax.hist(lats_pd, align='mid', bins=25)
plt.xlabel('Latitude')
plt.ylabel('Number of stops')
plt.title('Latitude of Stops')
plt.xticks(rotation=70)
plt.xticks(bins[:-1],['%.3f' % (x-0.5) for x in bins[:-1]])
plt.subplots_adjust(bottom=0.2)
print("Latitudes: ", lats_pd.describe())

Latitudes:  count    10738.000000
mean        43.725382
std          0.060305
min         43.591810
25%         43.677031
50%         43.721847
75%         43.771064
max         43.914362
dtype: float64

Interestingly, the standard deviation of the longitudes is significantly larger than the standard deviation of the latitudes. This indicates that Toronto is a city that is wider on the east-west axes than the north-south axes. Both are roughly mound-shaped and symmetric, however.

For us, this means that as long as you don't live in the fringes of Toronto, such as very far north or very far east/west, there is an adquate supply of bus stops.

Live Vehicle Location Data¶

What does real-time vehicle location reveal?

Thanks to NVAS, we can keep track of each bus or streetcar (but not subway) by simply querying a server. The service requires a route per request, so in order to collect data for the entire TTC, 178 requests (one "meta-request") had to be made. The data was saved into a Python pickle file.

The 178 requests took approximately 44 seconds and was run every two minutes for three weeks. Currently, there are a total of 10,126 pickle files, each containing the complete information of a meta-request.

The script to request the data and save it to a file is found in busLocations.py. For the sake of the filesystem, I've written a script to import this data into MongoDB, which can be found in busLocationsMongo.py.

Each data point in the database consists of:

a timestamp (in milliseconds since January 1st, 1970)
a latitude
a longitude
a direction tag (like northbound)
a vehicle ID (like 7945)
a heading (in compass degrees)
a route (like '25')

Let's count how many we have, and look at one specific data point.

In [18]:

collection = db.vehicles # the vehicles collection in the database
print("Number of points: %s" % collection.count())
print("An arbitrarily selected point: %s" % collection.find_one())

Number of points: 9420806
An arbitrarily selected point: {'routeTag': '191', '_id': ObjectId('536bedf4920c0e21c26c2e79'), 'dirTag': '191_1_191', 'heading': 71, 'vId': 7945, 'lat': 43.735985, 'timestamp': 1398694091674, 'long': -79.595016}

Wow! Over 9.42 million data points! There's a tremendous amount we could do with this data. Let's start by aggregating the headings of the vehicles. It's hard to say that heading is a discrete variable, but in this case we'll treat it as one since there are only 360 possible options. Fortunately, MongoDB's aggregate function makes this easy.

In [19]:

headings = collection.aggregate([{"$group":{ '_id': '$heading',
                                'count': {'$sum': 1}}},
                      {"$match": {'_id': {'$gte': 0}}}])
headings = headings['result']

Let's plot this on a polar bar chart, which is the obvious way to visualize this.

In [20]:

theta = []
r = []
for heading in sorted(headings, key=lambda x: x['_id']):
    theta.append(heading['_id'] * np.pi / 180)
    r.append(heading['count'])

ax = plt.subplot(111, polar=True)
ax.plot(theta, r)
ax.grid(True)

ax.set_theta_direction(-1)
ax.set_theta_zero_location('N')
plt.title("Headings of TTC vehicles.")
print("Maximum: %s"% max(r))
print("Values on two sides of maximum:",r[r.index(max(r)) -1 ], r[r.index(max(r)) + 1])
print("Angle of maximum (radians): ", theta[r.index(max(r))])
print("Total count: ", sum(r))

Maximum: 830814
Values on two sides of maximum: 3654 6356
Angle of maximum (radians):  3.7699111843077517
Total count:  8635317

The strange spike at 216 degrees can probably be attributed to instrumentation error, since it was three orders of magnitude larger than the values at 215 and 217 degrees. Let's remove it from our dataset and try again.

In [21]:

spike_idx = r.index(max(r))
del r[spike_idx]
del theta[spike_idx]

ax = plt.subplot(111, polar=True)
ax.plot(theta, r)
ax.grid(True)

ax.set_theta_direction(-1)
ax.set_theta_zero_location('N')
ti = plt.title("Headings of TTC vehicles.")

As expected, there are spikes heading north, east, south and west but very very little in between. The slight offset is likely due to Toronto's streets having an inclination, as seen in the geographic scatter plot of the buses. Another thing that's interesting, but also expected, is that the graph is mostly symmetrical -- vehicles spend the same amount of time heading east as they do west, and heading north as they do south. This also means that on average, vehicles travel at the same speed heading east and west, and north and south.

One item of note is that the east and west spikes are much larger than the north and south spikes. This indicates, like we saw earlier with the larger standard deviation in longitude, that Toronto's east-west axis is longer than Toronto's north-south axis.

Let's load it into Pandas for easier analysis.

In [22]:

headings_dataframe = pd.DataFrame(data={"heading":[x * 180 / np.pi for x in theta],
                                        "count": r})
print(headings_dataframe[:10])

   count  heading
0   8722        0
1   6229        1
2   5606        2
3   7725        3
4  10014        4
5   5448        5
6   8428        6
7   7050        7
8   4323        8
9   4621        9

Obviously, there are more rows in this data but I truncated it for brevity. To find the mean, median and mode, we will have to implement those methods ourselves.

In [23]:

total_headings = headings_dataframe['count'].sum()

# median
median_idx = total_headings // 2
current_idx = 0
for heading in range(len(headings_dataframe)):
    current_idx += headings_dataframe.iloc[heading]['count']
    if current_idx >= median_idx:
        median = headings_dataframe.iloc[heading]['heading']
        break

# mean
s = 0
for heading in range(len(headings_dataframe)):
    s += headings_dataframe.iloc[heading]['count'] * headings_dataframe.iloc[heading]['heading']
mean = s / total_headings

# standard deviation
s = 0
for heading in headings_dataframe.iterrows():
    n = heading[1]['count']
    h = heading[1]['heading']
    s += n * (h - mean) ** 2
stdev = math.sqrt((1/total_headings) * s)

# mode
modes = []
max_heading = max(headings_dataframe['count'])
modes = [x[1]['heading'] for x in headings_dataframe.iterrows() if x[1]['count'] == max_heading]

print("Mean: %.3f"% mean)
print("Standard deviation: %.3f" % stdev)
print("Median: %.0f"% median)
print("Modes: %.0f" % modes[0])

Mean: 197.525
Standard deviation: 102.488
Median: 179
Modes: 253

As expected, the mean and median are around 180 degrees, since the entire circle encompasses 360 degrees and the points are equally spaced out. The mode is 253 degrees, which could be expected since it is the west spike. The east spike is likely the second-most-popular option. The standard deviation here is not particularly useful.

Time Distribution¶

What happens to the number of active vehicles over time?

In [24]:

import pickle
from matplotlib.dates import HourLocator, DateFormatter, DayLocator
from datetime import datetime
from pytz import timezone
import pytz
tz = timezone('America/Toronto')

In [25]:

with open('time_dist.pickle', 'rb') as time_dist_pickle:
    time_dist = pickle.load(time_dist_pickle)[1:]
    times, dist = zip(*time_dist)
    # note that the timestamp from the pickle file already has timezone data
    date_times = [datetime.fromtimestamp(x) for x in times]
fig, axes = plt.subplots()
fig.set_size_inches(8,6)
days = DayLocator()
hours = HourLocator(interval=8)
formatter = DateFormatter('%b %d')
axes.plot_date(date_times,dist, 'b-')
axes.xaxis.set_major_locator(days)
axes.xaxis.set_minor_locator(hours)
axes.xaxis.set_major_formatter(formatter)
axes.set_xlabel('Date and Time')
axes.set_ylabel('Number of vehicles on duty')
axes.set_title('Number of TTC vehicles on duty from April 27th to May 4th')
plt.show()

Note that the lables indicate the start of the day, so the data here starts the morning of Monday, April 28th and ends shortly after Monday, May 5th. Also note that the small gaps on April 30th can be attributed to a server error.

Even without looking at the exact times, there are many more buses and streetcars on the road during rush hour on weekdays than there are on weekends, and much more than there are at night. Weekdays follow a bimodal distribution, whereas weekends follow a modal distribution. Also, there are consistently more vehicles active during morning rush hour than evening rush hour on weekdays.

At the peak of Saturday service, there are only as just as many vehicles as there are during the least busy hours of weekdays. There are even fewer active vehicles during peak hours on Sundays.

Velocity¶

What velocities do TTC vehicles travel at, and what are some factors that affect it?

In [26]:

#velocity_collection = db.velocity
#velocities = []
#for idx, velocity_point in enumerate(velocity_collection.find(sort=[('timestamp_orig', ASCENDING)])):
#    if idx % 1000 == 0:
#        print(idx, end='\r')
#    timestamp = velocity_point['timestamp_orig']
#    v = math.sqrt(velocity_point['vx'] ** 2 + velocity_point['vy'] ** 2)
#    velocities.append((timestamp, v))

In [27]:

with open('velocities.pkl', 'rb') as velocities_file:
    velocities = pickle.load(velocities_file)

date_times, dist = zip(*velocities)
del velocities

In [28]:

n, bins, patchs = plt.hist(dist, bins=25, range=(0,20))
t = plt.title('Aggregated TTC Vehicles Velocities')
plt.xlabel('Velocity (m/s)')
plt.ylabel("Number of Vehicle-Time Points")

Out[28]:

<matplotlib.text.Text at 0x7f4839712518>

In [29]:

velocities_pd = pd.DataFrame({'date_times': date_times,
                              'velocity': dist})
print(velocities_pd['velocity'].describe())
print("IQR:", interquartile_range(velocities_pd['velocity']))
print("Pearson's Index: ", pearsons_index(velocities_pd['velocity']))
print("Range: ", data_range(velocities_pd['velocity']))

count    8735310.000000
mean           5.363947
std           19.412805
min            0.000000
25%            2.230395
50%            4.529046
75%            6.747218
max        11713.119951
dtype: float64
IQR: 4.51682350164
Pearson's Index:  0.129023176137
Range:  11713.1199507

Clearly, a vehicle with a velocity of 11,713 m/s makes no sense (unless the TTC has a space program. For comparison, the ISS orbits at 7,000 m/s). Like before, let's take out the outliers.

In [30]:

outlier_b, outlier_t = outlier_range(velocities_pd['velocity'])
velocities_series = pd.Series(
    [x for x in velocities_pd['velocity'] if x >= outlier_b and x <= outlier_t])

n, bins, patchs = plt.hist(velocities_series, bins=25, range=(0,20))
t = plt.title('Aggregated TTC Vehicles Velocities')
plt.xlabel('Velocity (m/s)')
plt.ylabel("Number of Vehicle-Time Points")

print(velocities_series.describe())
print("IQR:", interquartile_range(velocities_series))
print("Pearson's Index: ", pearsons_index(velocities_series))
print("Range: ", data_range(velocities_series))

count    8617369.000000
mean           4.550967
std            3.079863
min            0.000000
25%            2.187578
50%            4.475319
75%            6.643428
max           13.522408
dtype: float64
IQR: 4.45584987475
Pearson's Index:  0.0736866892983
Range:  13.5224079069

As Pearson's Index indicates, there is slight right-skewing, but not very much. The range of the data is surprisingly small (after outliers were removed), from 0m/s (0km/h) to 13.5m/s (48.6km/h). Overall, the data is mostly mound-shaped and normal, except for a strong mode at 0 (likely due to resting vehicles / drivers) and a slight right skew.

Velocity and Time¶

How does velocity of TTC vehicles change over time?

In [31]:

velocity_moving_average = ([],[])
i = 0
while i < len(velocities_pd):
    start_point = velocities_pd.iloc[i]
    t_start = start_point['date_times']
    t_end = t_start + 1000 * 60 * 15  # fifteen minute moving averages
    s = 0
    n = 0
    while i < len(velocities_pd) and velocities_pd['date_times'][i] < t_end:
        this_point = velocities_pd.iloc[i]
        v = this_point['velocity']
        if v <= outlier_t and v >= outlier_b:
            s += v
            n += 1
        i += 100  # sample every 100th point for speed
    if n != 0:
        avg_vel = s / n
        velocity_moving_average[0].append(t_start)
        velocity_moving_average[1].append(avg_vel)
    i += 1    

In [32]:

date_times_gen = [datetime.fromtimestamp(x // 1000) for x in velocity_moving_average[0]]

fig, axes = plt.subplots()
fig.set_size_inches(8,6)
days = DayLocator(interval=2)
formatter = DateFormatter('%b %d')
axes.plot_date(date_times_gen,velocity_moving_average[1],'b-')
axes.xaxis.set_major_locator(days)
axes.xaxis.set_major_formatter(formatter)
axes.set_xlabel('Date')
axes.set_ylabel('Average Velocity (m/s)')
axes.set_title('Velocity of TTC Vehicles Measured Over Time')
plt.show()

This clearly shows that the moving average velocity of vehicles is periodic, but it's hard to see from this plot when. Let's look specifically at the time of day and ignore the date.

In [33]:

times_gen = [x.time() for x in date_times_gen]
fig, axes = plt.subplots()
fig.set_size_inches(8,6)
axes.plot(times_gen,velocity_moving_average[1], 'b.')
axes.set_xlabel('Time')
axes.set_ylabel('Average Velocity (m/s)')
axes.set_title('Velocity of TTC Vehicles Measured Over a Day')
plt.show()

Interestingly, the TTC has its vehicles at the fastest very late at night. This is likely because there is little traffic and also much fewer other TTC vehicles, so they do not have to maintain as far of a spacing. As well, there are much fewer people at these hours, so vehicles do not have to make as frequent stops. However, there is also much higher variance, likely due to vehicles being ahead of schedule and making long stops in order to synchronize.

Overall through the day, TTC vehicles appear to be mostly uniform in speed, with an overall median speed of 4.5 m/s, or approximately 16km/h. This is remarkably slower than expected, since most streets have a speed limit of between 40km/h and 80km/h.

Correlating Velocities and Busyness¶

Is there a correlation between velocity and busyness? (Measured by number of active vehicles.)

In [34]:

active_vehicles = db.active_vehicles

bus_vel = ([], [])
for i in range(len(velocity_moving_average[0])):
    ts = velocity_moving_average[0][i] // 1000
    vel = velocity_moving_average[1][i]
    number_active = active_vehicles.find_one({'timestamp':
                                              {'$gte': ts}},
                                             sort=[('timestamp',ASCENDING)])
    n = number_active['count']
    bus_vel[0].append(n)
    bus_vel[1].append(vel)

In [35]:

p = plt.plot(bus_vel[0], bus_vel[1], 'b.')
plt.xlabel("Number of Active Vehicles")
plt.ylabel("Average Velocity of Active Vehicles (m/s)")
t = plt.title("Average Velocity of Active Vehicles vs Number of Active Vehicles")

From the scatter plot, it looks like there is a weak negative correlation between the velocity of active vehicles and the number of active vehicles at a given time. Let's verify this by performing regression.

In [36]:

velocity_active_df = pd.DataFrame({"active": bus_vel[0], "vel": bus_vel[1]})
res = ols(y=velocity_active_df['vel'], x = velocity_active_df['active'])
print(res)

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         1273
Number of Degrees of Freedom:   2

R-squared:         0.2501
Adj R-squared:     0.2495

Rmse:              0.6348

F-stat (1, 1271):   423.9162, p-value:     0.0000

Degrees of Freedom: model 1, resid 1271

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x    -0.0007     0.0000     -20.59     0.0000    -0.0008    -0.0007
     intercept     5.4518     0.0381     143.14     0.0000     5.3771     5.5264
---------------------------------End of Summary---------------------------------

The coefficient of determination (R-squared) is 0.2501, which indicates a very weak fit. The coefficient in front of the x variable is negative, which indicates that it is a very weak negative correlation. This can also be interpreted as the velocities of the vehicles can be 25.01% explained by the number of active vehicles in the system.

This is actually an example of a confounding variable. The correlation between the average velocities of the vehicles and the number of active vehicles is likely due to having more vehicles in the middle of the day, where traffic is likely higher.

In [37]:

plt.hold(True)
p = plt.plot(bus_vel[0], bus_vel[1], 'b.')
plt.xlabel("Number of Active Vehicles")
plt.ylabel("Average Velocity of Active Vehicles (m/s)")
t = plt.title("Average Velocity of Active Vehicles vs Number of Active Vehicles")
m, b = -0.0007, 5.4518
x = np.arange(0, 2000, 100)
p = plt.plot(x, np.poly1d((m, b))(x), '-k')
plt.hold(False)