Tal Yarkoni (Email | Web | Twitter | GitHub), March 2014
w/ seaborn improvements by Michael Waskom
The first thing we need to do is import ye old faithful Python libraries--pandas, numpy, and matplotlib. Without these, we are nothing! (With them, we're still nothing, but we're a special kind of nothing that can analyze data and generate plots quickly.) We'll also import seaborn to get nice matplotlib defaults for free.
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns
sns.set(rc={"axes.labelsize": 15})
data = pd.read_csv('Restaurant_Inspection_Scores.csv')
data = data.drop_duplicates() # Get rid of a few duplicate lines
# Create unique ID column that's human-readable--concatenate name and address
data['string_id'] = [x.lower() for x in (data['Restaurant Name'] + '_' + data['Facility ID'].astype(str))]
# ...but use Facility ID for most analyses
data['id'] = data.pop('Facility ID')
# We'll also add a datetime column, which will come in handy later
data['date'] = pd.to_datetime(data['Inspection Date'])
# Also for convenience
restaurants = data.groupby('id')
# Print some stuff
print "Number of inspections:", len(data)
print "\nNumber of restaurants inspected:", len(restaurants)
print "\nColumns:\n", data.dtypes
Number of inspections: 19454 Number of restaurants inspected: 4405 Columns: Restaurant Name object Zip Code int64 Inspection Date object Score int64 Address object Process Description object string_id object id int64 date datetime64[ns] dtype: object
We have nearly 20,000 inspections from over 4,400 distinct establishments. For each inspection, we have (among other things) a health inspection score, an establishment name, an inspection date, and a description of the process (i.e., whether the inspection was routine, or reflected a follow-up visit triggered by an establishment failing an earlier inspection).
Okay, now that we have data, we can start our exploration. First, let's look at the histogram of scores:
def plot_histogram():
data['Score'].hist(bins=50, lw=0)
plt.xlabel('Health inspection score')
plt.ylabel('Count')
plot_histogram()
## Uncomment the next line to print a frequency table as well
# print data['Score'].value_counts().sort_index()
There are several interesting things to note here. One is that there are zero health inspection scores of 98 or 99--presumably because the smallest possible violation results in a loss of at least 3 points. A similar explanation probably accounts for the dip around 95--i.e., it's relatively unusual to getting docked exactly 5 points for a single infraction.
More curious, though, is what happens at the low end of the distribution--specifically around a score of 70, where there appears to be a relatively sharp discontinuity. To see it better, we can zoom in:
plot_histogram()
plt.xlim([60,80])
plt.ylim([0,200])
(0, 200)
That's a rather sharp, and decidedly artificial-looking, break. What could explain it, you ask? Well, it turns out that 70 is the minimum score required to pass inspection. If a restaurant gets a score below that, it fails inspection, which presumably triggers all kinds of unpleasant things--e.g., more paperwork for the inspector, bad PR for the restaurant, and a follow-up (non-routine) inspection one or two weeks later. So one possibility here is that there's some pressure on inspectors--whether explicit or implicit--to avoid giving restaurants failing scores. Alternatively, there are plenty of other more benign explanations. For example, it's possible that the health inspection guidelines explicitly allow inspectors to let restaurants off with a warning. Or perhaps the scoring algorithm isn't just a linear sum of all the violations observed, and there has to be some particularly egregious health code violation in order for a restaurant to receive a score below 70. Or, inspectors may be pragmatically working around weaknesses in the code--e.g., a restaurant may theoretically be able to fail inspection because of a large number of minor infractions, even if no single infraction presents any meaningful health risk to customers. Still, absent an explanation, there's at least some potential cause for concern here, and it's certainly consistent with the data that health inspectors might be systematically grading restaurants more leniently than they should be.
import re
def plot_restaurants_on_map(image_file, extent, cmap='summer', **kwargs):
# Extract latitude and longitude
addresses = restaurants['Address'].agg(['first','count'])
scores = restaurants['Score'].mean()
coords = [x.split('\n')[-1] for x in addresses['first']]
latitude, longitude = zip(*[re.findall('[-0-9.]+', c) for c in coords])
# Load the static underlay
with sns.axes_style("dark"):
fig, ax = plt.subplots(figsize=(12, 10))
im = plt.imread(image_file)
implot = ax.imshow(im, extent=extent)
# Add the restaurant overlay
cmap = mpl.cm.get_cmap(cmap, 5)
p = ax.scatter(longitude, latitude, c=scores, cmap=cmap, **kwargs)
ax.set(xlim=extent[:2], ylim=extent[2:])
cbar = fig.colorbar(p, fraction=0.03)
cbar.set_label('Restaurant\'s mean inspection score', rotation=270, fontsize=14, labelpad=20)
fig.tight_layout()
# Underlay of Austin area--we use a manually exported image from OpenStreetMap
image_file = 'austin.png'
extent = [-98.0901, -97.5140, 30.0964, 30.5132]
plot_restaurants_on_map(image_file, extent, cmap='Purples_r', alpha=0.6)
This map isn't terribly informative; to be perfectly honest, I added it mostly as an excuse to learn how to overlay layers on an image underlay in matplotlib (which turns out to be very straightforward once we've manually exported PNG underlays from OpenStreetMap). With some more work, we could probably generate a heatmap that handles geographic variations in inspection quality more sensibly, though it's unlikely that there are (m)any meaningful differences one could discern at this scale. But if nothing else, it gives us a sense of where all of the restaurants in Austin are located. Not surprisingly, they're mostly massed downtown and/or around the major North-South roads through the city--Congress, Lamar, etc.
Since it's hard to see much of anything at this scale, let's blow things up and take a closer look at downtown.
# Second OSM map--this one only covers downtown.
image_file = 'austin_downtown.png'
extent = [-97.7638, -97.7278, 30.2527, 30.2787]
plot_restaurants_on_map(image_file, extent, cmap='PuBu_r', alpha=1.0, s=5 * restaurants['Score'].count() ** 2)
Here it's much easier to see what's going on. A huge proportion of restaurants are located on Congress Ave or Sixth Street, but it's not obvious that there's any pattern to the distribution of health inspection scores. Just for fun, I've used marker size to index the number of inspections each restaurant has undergone in the past 3 years. At a cursory glance, it looks as though restaurants in the South Lamar/Barton Springs area (bottom left corner) and East Side (right edge) tend to get inspected more than restaurants downtown proper, but that probably just reflects more rapid turnover downtown, where rents are higher.
An important question is whether scores are stable over time within individual restaurants. If a restaurant gets a good score at Time 1, is it also likely to get a good score at Time 2? Because the vast majority of restaurants have been inspected multiple times over the last three years, we can directly answer this question by computing the test-retest correlation across multiple inspections. A quick, though somewhat crude, way to do this is to randomly select two inspections for every restaurant with multiple inspections and compute the correlation. This leaves us with 4074 restaurants, and the resulting scatter plot looks like this:
# Filter for restaurants with > 1 inspection
two_or_more = restaurants.filter(lambda x: x.shape[0] > 1)
print "Number of restaurants with two or more inspections:", two_or_more['id'].nunique()
# Shuffle order and select a random pair for each restaurant
two_or_more = two_or_more.reindex(np.random.permutation(two_or_more.index))
random_pairs = two_or_more.groupby('id', as_index=False).head(2).sort('id')
random_pairs['number'] = np.tile(np.array([1,2]), len(random_pairs)/2)
pairs = random_pairs.pivot(index='id', columns='number', values='Score')
r, p = stats.pearsonr(*pairs.values.T)
# Plot the relationship
f, ax = plt.subplots(figsize=(6, 6))
sns.regplot(pairs[1], pairs[2], x_jitter=2, y_jitter=2,
color="#334477", scatter_kws={"alpha": .05, "s":100})
ax.text(62, 72, "r = %.2f" % r, fontsize=14)
ax.set(xlim=(60, 105), ylim=(60, 105),
xlabel='Score for Inspection 1',
ylabel='Score for Inspection 2');
Number of restaurants with two or more inspections: 4074
The test-retest correlation of ~.5 indicates that there is indeed a fair amount of consistency to the scores. That's reassuring, in that a very low correlation might lead us to worry that the health inspection process itself is unreliable, since it's hard to imagine that there aren't real differences in how mindful different proprietors are of health code--not to mention the fact that some kinds of establishments are likely to be at much higher risk of code violations than others in virtue of the kind of food they serve.
While we don't have any obvious way of identifying the conscientiousness level of individual proprietors from this dataset, we do have the ability to break out establishments by the kind of food they serve, because we have the full name and address of every restaurant. In an ideal world (i.e., one where your humble narrator has an infinite amount of time to allot to this kind of thing), we could query an API somewhere and systematically code every restaurant by type of cuisine. But in this less-than-ideal, time-pressed world, we're going to categorize restaurant type much more crudely. We'll take advantage of the fact that many restaurants use their name to announce the kind of food they serve--witness, for example, "Carino's Italian", "Asia Chinese Restaurant", and "Mi Casa Mexican Restaurant". By grouping together restaurants with the same substring in their names, we can generate the following plot:
def plot_restaurant_score_by_type(types):
""" Takes a list of strings, each defining a group of restaurants that contain that
particular string within their title. """
means, sems, labels = [], [], []
n_types = len(types)
# Generate means, CI/SEM, and labels
for c in types:
stores = data[data['string_id'].str.contains(c)]
unique_stores = stores.groupby('string_id')['Score'].mean()
n_stores = len(unique_stores)
n_inspections = len(stores)
std = unique_stores.std()
means.append(unique_stores.mean())
sems.append(stats.sem(unique_stores))
labels.append('"' + c + '" (%d, %d)' % (n_stores, n_inspections))
# Order by descending score
plot_data = pd.DataFrame({'mean':means, 'sem':sems}, index=labels)
plot_data = plot_data.sort('mean', ascending=True)
# Plot
pal = sns.color_palette("husl", len(plot_data))
f, ax = plt.subplots(figsize=(4, 8))
for y, (label, (mean, sem)) in enumerate(plot_data.iterrows()):
ax.errorbar(mean, y, xerr=sem, fmt='o', color=pal[y])
ax.set_ylim([-1, n_types])
ax.set_yticks(np.arange(len(plot_data)))
ax.set_yticklabels(plot_data.index, rotation=0, horizontalalignment='right')
ax.set_xlabel('Health inspection score', fontsize=14)
ax.set_ylabel('Restaurant name contains...', fontsize=14)
types = ['chin', 'mexic', 'indian', 'thai', 'italian', 'taco', 'sushi|japan',
'pizz', 'korea', 'burger', 'donut', 'coffee', 'bakery', 'ice cream',
'chicken', 'buffet', 'grill', 'bbq|barbe', 'steak']
plot_restaurant_score_by_type(types)
The values in parentheses on the y-axis labels denote the number of unique restaurants and total inspections in each group, respectively. The error bars denote the 95% confidence interval.
The plot largely corroborates what we probably already knew, with perhaps a couple of minor surprises. It's not surprising that establishments primarily serving coffee, ice cream, or pastries tend to do very well on health inspections. Presumably there aren't very many ways you can dangerously contaminate someone's coffee (assuming your barista isn't stirring milk into customers' cups with his or her index finger).
At the other end of the spectrum, the 8 restaurants with the word "buffet" in their name do pretty abysmally. Their average score of 77 is pretty close to the magical failure number of 70. Across 36 different inspections, no "buffet"-containing resturant in Austin has managed to obtain a score higher than 90 in the past 3 years. Which means that the single best "buffet" in Austin is no safer than the average McDonald's (see below). Of course, this conclusion only applies to buffet restaurants that have the word "buffet" in their name. It's entirely possible that the much larger number of all-you-can-eat restaurants not represented in this analysis are paragons of sanitary dining. So I'm not saying you should avoid buffets, period. I'm just... saying.
Also not happy winners in this analysis: restaurants serving ethnic food--with the possible exception of Indian restaurants (though the variance for Indian restaurants is high). Asian restaurants do particularly poorly; for example, "thai"-containing restaurants obtain a score of 83, on average. On a positive note, though, we can see that the 95-odd "burger"-containing establishments in this list--you know, the ones where cooks spend much of their day wading around in buckets of raw meat--tend to have very good scores (perhaps because of the perception that health inspectors are gunning for them, I don't know). So, given a choice between Thai food and burger, health code-wise, you're arguably better off with burger. Of course, in the grand scheme of things, these are not terribly large differences, and the vast majority of the time, you're going to be just fine after eating pretty much anywhere (including even those restaurants that fail inspection).
Since it's easy to bin restaurants by title, something else we can do is look at the performance of major food chains. Here's what that looks like, for selected chains with major representation in Austin:
types = ['starbucks', 'mcdonald', 'subway', 'popeye', '7-eleven', 'domino',
'jamba', 'schlotzsky', 'taco shack', 'burger king', 'wendy',
'panda', 'chick-fil']
plot_restaurant_score_by_type(types)
Good news! It's pretty much impossible to get sick at Jamba Juice or Starbucks. Also, if you're looking for health code-friendly Asian food, Panda Express is your new best friend. And if your primary basis for deciding between drive-thrus is tiny differences in health inspection score, you may want to opt for Wendy's or Burger King over McDonald's. Otherwise, there's nothing terribly interesting here.
So far we've focused on what the restaurants bring to the table (pun intended--sorry), but we can also take a look at circumstantial factors surrounding the inspections themselves. We don't have any way to identify individual inspectors, unfortunately (which would have allowed us to determine how consistently the health code is applied from person to person), but we can look at scores as a function of time--year, month, week, etc. For instance, here's mean health inspection score as a function of month of year:
import calendar
def plot_inspections_by_time(unit, labels, summary=False, **kwargs):
data[unit] = data['date'].map(lambda d: labels[getattr(d, unit)][:3])
if summary:
print "Summary of scores:"
print np.round(data.groupby(unit)['Score'].agg(['mean', 'count']), 1)
f, ax = plt.subplots(figsize=(10, 5))
return sns.pointplot(unit, "Score", data=data, x_order=[n[:3] for n in labels[1:]], **kwargs)
_ = plot_inspections_by_time('month', calendar.month_name, True).set_ylim(82, 96)
Summary of scores: mean count month Apr 91.6 1856 Aug 91.5 1711 Dec 91.6 1478 Feb 91.2 1495 Jan 90.9 1739 Jul 91.2 1512 Jun 91.3 1608 Mar 91.7 1472 May 91.7 1710 Nov 91.9 1651 Oct 91.8 1816 Sep 91.7 1406 [12 rows x 2 columns]
The error bars display the the bootstrapped 95% confidence interval for each point. You could maybe make a case that, say, restaurants perform worse in January (mean = 90.9) than in November (91.9). But it would be a pretty weak case, and would probably have little practical value in any case (note the limited range of the y axis).
Somewhat more interesting is this:
# Plot by weekday
_ = plot_inspections_by_time('dayofweek', calendar.day_name, True).set_xlabel('Day of week')
Summary of scores: mean count dayofweek Fri 92.1 3114 Mon 91.4 3108 Sat 92.9 137 Sun 88.6 12 Thu 91.5 4454 Tue 91.4 4670 Wed 91.3 3959 [7 rows x 2 columns]
Here we see scores plotted by day of the week on which the inspection occurred. Before you get excited, this plot is not saying that restaurant hygiene levels improve progressively through the week before plummeting on Sunday. What it is saying is that many fewer inspections take place on Saturday and Sunday--especially Sunday (only 12 inspections over 3 years!). That's why the error bars for the weekend days are so huge: we can't estimate the actual score very well, because we don't have enough data. We can see this more clearly by plotting the counts directly:
day_names = list(calendar.day_name)
data["weekday"] = data["date"].map(lambda d: day_names[d.weekday()])
sns.barplot("weekday", data=data, x_order=day_names, palette="PuBuGn_d");
This is potentially interesting from a public health standpoint, inasmuch as (a) restaurants do some of their best business on weekends, so ignoring the weekends means inspectors don't have a chance to see what the kitchen looks like when things get hectic, and (b) if you're an unscrupulous proprietor, the fact that inspectors almost never show up on the weekend arguably allows you to worry a little bit less about food sanitation between Friday night and Monday morning. Given that food inspectors probably already have relatively non-standard hours (I imagine many if not most restaurant visits occur after 5 pm), and the element of surprise would seem to be an important one, the infrequency of weekend inspections seems somewhat surprising. But I'm not a public health official, so I imagine there's probably a good reason for this (I mean, aside from just not wanting to pay health inspectors more--which is, I suppose, a pretty good reason).
An obvious question to ask is: do health inspections actually work? That is, do restaurants shape up in the days immediately following a failed inspection? The answer here is (fortunately!) a resounding yes:
# Get pairs of consecutive inspections, with columns for difference in score and days elapsed
new_data = data.copy().sort(['id', 'date'])
new_data['orig_id'] = new_data.index
shifted = new_data[['Score', 'date']].shift(1)
firsts = new_data.groupby('id', sort=False).first()['orig_id']
shifted.loc[firsts] = np.nan
new_data[['shifted_score', 'shifted_date']] = shifted
new_data = new_data.dropna()
new_data['score_diff'] = new_data['Score'] - new_data['shifted_score']
new_data['date_diff'] = (new_data['date'] - new_data['shifted_date']).astype('timedelta64[D]').astype(int)
# Map inspection types to some ugly-ass colors
processes = list(new_data['Process Description'].unique())
colordict = dict(zip(processes, sns.color_palette("muted", 3)))
markers = [] # Track layers for legend
# Loop over inspection types, print stats, add to marker list
print "Correlation between time since last inspection and change in score, by inspection type:"
for kind, color in colordict.items():
tmp = new_data[new_data['Process Description'] == kind]
r = tmp.corr().iloc[0,1]
mean_change = tmp['score_diff'].mean()
print "\t[%s]: r = %.2f, n = %d, mean change in score = %.1f" % (kind, r, len(tmp), mean_change)
p = plt.scatter(tmp['date_diff'], tmp['score_diff'], alpha=0.6, color=color, s=30)
markers.append(p)
plt.xlim([-5, 500]) # Truncate to reasonable limits
plt.ylim([-40, 40])
plt.xlabel("Days since last inspection", fontsize=14)
plt.ylabel("Change in score from last inspection", fontsize=14)
legend = plt.legend(markers, colordict.keys(), frameon=True, fontsize=12)
frame = legend.get_frame()
frame.set_facecolor("0.9")
frame.set_edgecolor("0.3")
Correlation between time since last inspection and change in score, by inspection type: [2nd Follow Up to 50 - 69]: r = -0.82, n = 3, mean change in score = 21.7 [1st Follow Up to Routine of 50 - 69]: r = -0.00, n = 78, mean change in score = 21.4 [Routine Inspection]: r = 0.04, n = 14968, mean change in score = -0.1
Here, for every pair of consecutive inspections in the dataset, we're plotting the number of days elapsed since the previous inspection (x-axis) against the change in score between inspections (y-axis). The cases are coloring by the kind of inspection (i.e., routine inspections vs. follow-up to earlier failed inspection). Note that follow-ups are invariably conducted within about a month of a failed inspection. More importantly, there's almost always a substantial improvement in score for these inspections (mean improvement of > 20 points), which is a good sign--and perhaps not surprising, given that repeated health code violations are a quick way for a restaurant to find its doors closed (a cynical statistician might also grumble about regression to the mean, but we'll let that go).
A related question the plot above lets us answer is whether there's any relationship between time since last inspection and change in score. Intuitively, you might think that maybe restaurants tend to "slip" over time, so that restaurants would do better if they were randomly re-inspected at fairly short intervals. But this doesn't appear to be the case; the correlation between the x and y dimensions for routine inspections is -.04, which is in the predicted direction, but probably not strong enough to write home about. So it doesn't look like more frequent inspections keep restaurants honest--though presumably they do allow inspectors to catch violators more quickly.
The health inspection dataset has its limits; as I mentioned above, the inspection records don't include some potentially useful features, like who carried out each inspection, how long the inspection lasted, what each restaurant got dinged for, and so on. Still, there are plenty of other questions we could ask with the existing data, given enough time and energy (want to contribute? Clone the open data flights repo!). For example:
So, summing up, what have we learned from this exercise? In no particular order: