Notebook

Avoid the buffet

An analysis of Austin restaurant health inspection scores, 2011 - 2014

Tal Yarkoni (Email | Web | Twitter | GitHub), March 2014

w/ seaborn improvements by Michael Waskom

This notebook combines four of my favorite things: Austin, eating out, data mining, and Python (not necessarily in that order). The impetus was a random search through the City of Austin's slick data portal, which provides interactive access to over 300 datasets and views. Included among these is a comprehensive set of restaurant health inspection scores for the last 3 years. You can search the data online here. Or, if you're so inclined, you can download it and play with it yourself. Since I like food, I decided that I'm so inclined. The rest of this post describes, in a rather large number of words, some analyses of this data. At the very end I summarize what I think I've learned about health inspections in Austin, so if you're already bored, you can skip to the bottom. But if you do, you'll miss all the chewy, delicious code and figures!

Preliminaries

The first thing we need to do is import ye old faithful Python libraries--pandas, numpy, and matplotlib. Without these, we are nothing! (With them, we're still nothing, but we're a special kind of nothing that can analyze data and generate plots quickly.) We'll also import seaborn to get nice matplotlib defaults for free.

In [1]:

import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns
sns.set(rc={"axes.labelsize": 15})

AbracaDATAbra!

Next, we read in the data from CSV and take a look at what we're working with. We also manage to come up with a rather silly section header.

In [2]:

data = pd.read_csv('Restaurant_Inspection_Scores.csv')
data = data.drop_duplicates()  # Get rid of a few duplicate lines

# Create unique ID column that's human-readable--concatenate name and address
data['string_id'] = [x.lower() for x in (data['Restaurant Name'] + '_' + data['Facility ID'].astype(str))]

# ...but use Facility ID for most analyses
data['id'] = data.pop('Facility ID')

# We'll also add a datetime column, which will come in handy later
data['date'] = pd.to_datetime(data['Inspection Date'])

# Also for convenience
restaurants = data.groupby('id')

# Print some stuff
print "Number of inspections:", len(data)
print "\nNumber of restaurants inspected:", len(restaurants)
print "\nColumns:\n", data.dtypes

Number of inspections: 19454

Number of restaurants inspected: 4405

Columns:
Restaurant Name                object
Zip Code                        int64
Inspection Date                object
Score                           int64
Address                        object
Process Description            object
string_id                      object
id                              int64
date                   datetime64[ns]
dtype: object

We have nearly 20,000 inspections from over 4,400 distinct establishments. For each inspection, we have (among other things) a health inspection score, an establishment name, an inspection date, and a description of the process (i.e., whether the inspection was routine, or reflected a follow-up visit triggered by an establishment failing an earlier inspection).

To the plotmobile!

Okay, now that we have data, we can start our exploration. First, let's look at the histogram of scores:

In [3]:

def plot_histogram():
    data['Score'].hist(bins=50, lw=0)
    plt.xlabel('Health inspection score')
    plt.ylabel('Count')

plot_histogram()

## Uncomment the next line to print a frequency table as well
# print data['Score'].value_counts().sort_index()

There are several interesting things to note here. One is that there are zero health inspection scores of 98 or 99--presumably because the smallest possible violation results in a loss of at least 3 points. A similar explanation probably accounts for the dip around 95--i.e., it's relatively unusual to getting docked exactly 5 points for a single infraction.

More curious, though, is what happens at the low end of the distribution--specifically around a score of 70, where there appears to be a relatively sharp discontinuity. To see it better, we can zoom in:

In [4]:

plot_histogram()
plt.xlim([60,80])
plt.ylim([0,200])

Out[4]:

(0, 200)

That's a rather sharp, and decidedly artificial-looking, break. What could explain it, you ask? Well, it turns out that 70 is the minimum score required to pass inspection. If a restaurant gets a score below that, it fails inspection, which presumably triggers all kinds of unpleasant things--e.g., more paperwork for the inspector, bad PR for the restaurant, and a follow-up (non-routine) inspection one or two weeks later. So one possibility here is that there's some pressure on inspectors--whether explicit or implicit--to avoid giving restaurants failing scores. Alternatively, there are plenty of other more benign explanations. For example, it's possible that the health inspection guidelines explicitly allow inspectors to let restaurants off with a warning. Or perhaps the scoring algorithm isn't just a linear sum of all the violations observed, and there has to be some particularly egregious health code violation in order for a restaurant to receive a score below 70. Or, inspectors may be pragmatically working around weaknesses in the code--e.g., a restaurant may theoretically be able to fail inspection because of a large number of minor infractions, even if no single infraction presents any meaningful health risk to customers. Still, absent an explanation, there's at least some potential cause for concern here, and it's certainly consistent with the data that health inspectors might be systematically grading restaurants more leniently than they should be.

In [5]:

import re

def plot_restaurants_on_map(image_file, extent, cmap='summer', **kwargs):
    
    # Extract latitude and longitude
    addresses = restaurants['Address'].agg(['first','count'])
    scores = restaurants['Score'].mean()
    coords = [x.split('\n')[-1] for x in addresses['first']]
    latitude, longitude = zip(*[re.findall('[-0-9.]+', c) for c in coords])
    
    # Load the static underlay
    with sns.axes_style("dark"):
        fig, ax = plt.subplots(figsize=(12, 10))
    im = plt.imread(image_file)
    implot = ax.imshow(im, extent=extent)
    
    # Add the restaurant overlay
    cmap = mpl.cm.get_cmap(cmap, 5)
    p = ax.scatter(longitude, latitude, c=scores, cmap=cmap, **kwargs)
    ax.set(xlim=extent[:2], ylim=extent[2:])

    cbar = fig.colorbar(p, fraction=0.03)
    cbar.set_label('Restaurant\'s mean inspection score', rotation=270, fontsize=14, labelpad=20)
    fig.tight_layout()
    
# Underlay of Austin area--we use a manually exported image from OpenStreetMap
image_file = 'austin.png'
extent = [-98.0901, -97.5140, 30.0964, 30.5132]

plot_restaurants_on_map(image_file, extent, cmap='Purples_r', alpha=0.6)

This map isn't terribly informative; to be perfectly honest, I added it mostly as an excuse to learn how to overlay layers on an image underlay in matplotlib (which turns out to be very straightforward once we've manually exported PNG underlays from OpenStreetMap). With some more work, we could probably generate a heatmap that handles geographic variations in inspection quality more sensibly, though it's unlikely that there are (m)any meaningful differences one could discern at this scale. But if nothing else, it gives us a sense of where all of the restaurants in Austin are located. Not surprisingly, they're mostly massed downtown and/or around the major North-South roads through the city--Congress, Lamar, etc.

Since it's hard to see much of anything at this scale, let's blow things up and take a closer look at downtown.

In [6]:

# Second OSM map--this one only covers downtown.
image_file = 'austin_downtown.png'
extent = [-97.7638, -97.7278, 30.2527, 30.2787]

plot_restaurants_on_map(image_file, extent, cmap='PuBu_r', alpha=1.0, s=5 * restaurants['Score'].count() ** 2)

Here it's much easier to see what's going on. A huge proportion of restaurants are located on Congress Ave or Sixth Street, but it's not obvious that there's any pattern to the distribution of health inspection scores. Just for fun, I've used marker size to index the number of inspections each restaurant has undergone in the past 3 years. At a cursory glance, it looks as though restaurants in the South Lamar/Barton Springs area (bottom left corner) and East Side (right edge) tend to get inspected more than restaurants downtown proper, but that probably just reflects more rapid turnover downtown, where rents are higher.

Would a restaurant on any other day smell as sweet?

An important question is whether scores are stable over time within individual restaurants. If a restaurant gets a good score at Time 1, is it also likely to get a good score at Time 2? Because the vast majority of restaurants have been inspected multiple times over the last three years, we can directly answer this question by computing the test-retest correlation across multiple inspections. A quick, though somewhat crude, way to do this is to randomly select two inspections for every restaurant with multiple inspections and compute the correlation. This leaves us with 4074 restaurants, and the resulting scatter plot looks like this:

In [7]:

# Filter for restaurants with > 1 inspection
two_or_more = restaurants.filter(lambda x: x.shape[0] > 1)
print "Number of restaurants with two or more inspections:", two_or_more['id'].nunique()

# Shuffle order and select a random pair for each restaurant
two_or_more = two_or_more.reindex(np.random.permutation(two_or_more.index))
random_pairs = two_or_more.groupby('id', as_index=False).head(2).sort('id')
random_pairs['number'] = np.tile(np.array([1,2]), len(random_pairs)/2)
pairs = random_pairs.pivot(index='id', columns='number', values='Score')
r, p = stats.pearsonr(*pairs.values.T)

# Plot the relationship
f, ax = plt.subplots(figsize=(6, 6))
sns.regplot(pairs[1], pairs[2], x_jitter=2, y_jitter=2,
            color="#334477", scatter_kws={"alpha": .05, "s":100})

ax.text(62, 72, "r = %.2f" % r, fontsize=14)
ax.set(xlim=(60, 105), ylim=(60, 105),
        xlabel='Score for Inspection 1',
        ylabel='Score for Inspection 2');

Number of restaurants with two or more inspections: 4074

The test-retest correlation of ~.5 indicates that there is indeed a fair amount of consistency to the scores. That's reassuring, in that a very low correlation might lead us to worry that the health inspection process itself is unreliable, since it's hard to imagine that there aren't real differences in how mindful different proprietors are of health code--not to mention the fact that some kinds of establishments are likely to be at much higher risk of code violations than others in virtue of the kind of food they serve.

Avoid the buffet!

While we don't have any obvious way of identifying the conscientiousness level of individual proprietors from this dataset, we do have the ability to break out establishments by the kind of food they serve, because we have the full name and address of every restaurant. In an ideal world (i.e., one where your humble narrator has an infinite amount of time to allot to this kind of thing), we could query an API somewhere and systematically code every restaurant by type of cuisine. But in this less-than-ideal, time-pressed world, we're going to categorize restaurant type much more crudely. We'll take advantage of the fact that many restaurants use their name to announce the kind of food they serve--witness, for example, "Carino's Italian", "Asia Chinese Restaurant", and "Mi Casa Mexican Restaurant". By grouping together restaurants with the same substring in their names, we can generate the following plot:

In [8]:

def plot_restaurant_score_by_type(types):
    """ Takes a list of strings, each defining a group of restaurants that contain that 
    particular string within their title. """
    
    means, sems, labels = [], [], []
    n_types = len(types)
    
    # Generate means, CI/SEM, and labels
    for c in types:
        stores = data[data['string_id'].str.contains(c)]
        unique_stores = stores.groupby('string_id')['Score'].mean()
        n_stores = len(unique_stores)
        n_inspections = len(stores)
        std = unique_stores.std()
        means.append(unique_stores.mean())
        sems.append(stats.sem(unique_stores))
        labels.append('"' + c + '" (%d, %d)' % (n_stores, n_inspections))
    
    # Order by descending score
    plot_data = pd.DataFrame({'mean':means, 'sem':sems}, index=labels)
    plot_data = plot_data.sort('mean', ascending=True)

    # Plot
    pal = sns.color_palette("husl", len(plot_data))
    f, ax = plt.subplots(figsize=(4, 8))
    for y, (label, (mean, sem)) in enumerate(plot_data.iterrows()):
        ax.errorbar(mean, y, xerr=sem, fmt='o', color=pal[y])
    ax.set_ylim([-1, n_types])
    ax.set_yticks(np.arange(len(plot_data)))
    ax.set_yticklabels(plot_data.index, rotation=0, horizontalalignment='right')
    ax.set_xlabel('Health inspection score', fontsize=14)
    ax.set_ylabel('Restaurant name contains...', fontsize=14)
    
types = ['chin', 'mexic', 'indian', 'thai', 'italian', 'taco', 'sushi|japan', 
         'pizz', 'korea', 'burger', 'donut', 'coffee', 'bakery', 'ice cream', 
         'chicken', 'buffet', 'grill', 'bbq|barbe', 'steak']

plot_restaurant_score_by_type(types)

The values in parentheses on the y-axis labels denote the number of unique restaurants and total inspections in each group, respectively. The error bars denote the 95% confidence interval.

The plot largely corroborates what we probably already knew, with perhaps a couple of minor surprises. It's not surprising that establishments primarily serving coffee, ice cream, or pastries tend to do very well on health inspections. Presumably there aren't very many ways you can dangerously contaminate someone's coffee (assuming your barista isn't stirring milk into customers' cups with his or her index finger).

At the other end of the spectrum, the 8 restaurants with the word "buffet" in their name do pretty abysmally. Their average score of 77 is pretty close to the magical failure number of 70. Across 36 different inspections, no "buffet"-containing resturant in Austin has managed to obtain a score higher than 90 in the past 3 years. Which means that the single best "buffet" in Austin is no safer than the average McDonald's (see below). Of course, this conclusion only applies to buffet restaurants that have the word "buffet" in their name. It's entirely possible that the much larger number of all-you-can-eat restaurants not represented in this analysis are paragons of sanitary dining. So I'm not saying you should avoid buffets, period. I'm just... saying.

Also not happy winners in this analysis: restaurants serving ethnic food--with the possible exception of Indian restaurants (though the variance for Indian restaurants is high). Asian restaurants do particularly poorly; for example, "thai"-containing restaurants obtain a score of 83, on average. On a positive note, though, we can see that the 95-odd "burger"-containing establishments in this list--you know, the ones where cooks spend much of their day wading around in buckets of raw meat--tend to have very good scores (perhaps because of the perception that health inspectors are gunning for them, I don't know). So, given a choice between Thai food and burger, health code-wise, you're arguably better off with burger. Of course, in the grand scheme of things, these are not terribly large differences, and the vast majority of the time, you're going to be just fine after eating pretty much anywhere (including even those restaurants that fail inspection).

Panda Express, Austin's shining beacon of sanitary dining

Since it's easy to bin restaurants by title, something else we can do is look at the performance of major food chains. Here's what that looks like, for selected chains with major representation in Austin:

In [9]:

types = ['starbucks', 'mcdonald', 'subway', 'popeye', '7-eleven', 'domino', 
         'jamba', 'schlotzsky', 'taco shack', 'burger king', 'wendy', 
         'panda', 'chick-fil']

plot_restaurant_score_by_type(types)

Good news! It's pretty much impossible to get sick at Jamba Juice or Starbucks. Also, if you're looking for health code-friendly Asian food, Panda Express is your new best friend. And if your primary basis for deciding between drive-thrus is tiny differences in health inspection score, you may want to opt for Wendy's or Burger King over McDonald's. Otherwise, there's nothing terribly interesting here.

And on the sixth and seventh days, the inspectors rested. Especially on the seventh day.

So far we've focused on what the restaurants bring to the table (pun intended--sorry), but we can also take a look at circumstantial factors surrounding the inspections themselves. We don't have any way to identify individual inspectors, unfortunately (which would have allowed us to determine how consistently the health code is applied from person to person), but we can look at scores as a function of time--year, month, week, etc. For instance, here's mean health inspection score as a function of month of year:

In [10]:

import calendar

def plot_inspections_by_time(unit, labels, summary=False, **kwargs):
    data[unit] = data['date'].map(lambda d: labels[getattr(d, unit)][:3])
    if summary:
        print "Summary of scores:"
        print np.round(data.groupby(unit)['Score'].agg(['mean', 'count']), 1)
    f, ax = plt.subplots(figsize=(10, 5))
    return sns.pointplot(unit, "Score", data=data, x_order=[n[:3] for n in labels[1:]], **kwargs)
    
_ = plot_inspections_by_time('month', calendar.month_name, True).set_ylim(82, 96)

Summary of scores:
       mean  count
month             
Apr    91.6   1856
Aug    91.5   1711
Dec    91.6   1478
Feb    91.2   1495
Jan    90.9   1739
Jul    91.2   1512
Jun    91.3   1608
Mar    91.7   1472
May    91.7   1710
Nov    91.9   1651
Oct    91.8   1816
Sep    91.7   1406

[12 rows x 2 columns]

The error bars display the the bootstrapped 95% confidence interval for each point. You could maybe make a case that, say, restaurants perform worse in January (mean = 90.9) than in November (91.9). But it would be a pretty weak case, and would probably have little practical value in any case (note the limited range of the y axis).

Somewhat more interesting is this:

In [11]:

# Plot by weekday
_ = plot_inspections_by_time('dayofweek', calendar.day_name, True).set_xlabel('Day of week')

Summary of scores:
           mean  count
dayofweek             
Fri        92.1   3114
Mon        91.4   3108
Sat        92.9    137
Sun        88.6     12
Thu        91.5   4454
Tue        91.4   4670
Wed        91.3   3959

[7 rows x 2 columns]

Here we see scores plotted by day of the week on which the inspection occurred. Before you get excited, this plot is not saying that restaurant hygiene levels improve progressively through the week before plummeting on Sunday. What it is saying is that many fewer inspections take place on Saturday and Sunday--especially Sunday (only 12 inspections over 3 years!). That's why the error bars for the weekend days are so huge: we can't estimate the actual score very well, because we don't have enough data. We can see this more clearly by plotting the counts directly:

In [12]:

day_names = list(calendar.day_name)
data["weekday"] = data["date"].map(lambda d: day_names[d.weekday()])
sns.barplot("weekday", data=data, x_order=day_names, palette="PuBuGn_d");

This is potentially interesting from a public health standpoint, inasmuch as (a) restaurants do some of their best business on weekends, so ignoring the weekends means inspectors don't have a chance to see what the kitchen looks like when things get hectic, and (b) if you're an unscrupulous proprietor, the fact that inspectors almost never show up on the weekend arguably allows you to worry a little bit less about food sanitation between Friday night and Monday morning. Given that food inspectors probably already have relatively non-standard hours (I imagine many if not most restaurant visits occur after 5 pm), and the element of surprise would seem to be an important one, the infrequency of weekend inspections seems somewhat surprising. But I'm not a public health official, so I imagine there's probably a good reason for this (I mean, aside from just not wanting to pay health inspectors more--which is, I suppose, a pretty good reason).

The unreasonable effectiveness of "clean up, or else"

An obvious question to ask is: do health inspections actually work? That is, do restaurants shape up in the days immediately following a failed inspection? The answer here is (fortunately!) a resounding yes:

In [13]:

# Get pairs of consecutive inspections, with columns for difference in score and days elapsed
new_data = data.copy().sort(['id', 'date'])
new_data['orig_id'] = new_data.index
shifted = new_data[['Score', 'date']].shift(1)
firsts = new_data.groupby('id', sort=False).first()['orig_id']
shifted.loc[firsts] = np.nan
new_data[['shifted_score', 'shifted_date']] = shifted
new_data = new_data.dropna()
new_data['score_diff'] = new_data['Score'] - new_data['shifted_score']
new_data['date_diff'] = (new_data['date'] - new_data['shifted_date']).astype('timedelta64[D]').astype(int)

# Map inspection types to some ugly-ass colors
processes = list(new_data['Process Description'].unique())
colordict = dict(zip(processes, sns.color_palette("muted", 3)))

markers = [] # Track layers for legend

# Loop over inspection types, print stats, add to marker list
print "Correlation between time since last inspection and change in score, by inspection type:"
for kind, color in colordict.items():
    tmp = new_data[new_data['Process Description'] == kind]
    r = tmp.corr().iloc[0,1]
    mean_change = tmp['score_diff'].mean()
    print "\t[%s]: r = %.2f, n = %d, mean change in score = %.1f" % (kind, r, len(tmp), mean_change)
    p = plt.scatter(tmp['date_diff'], tmp['score_diff'], alpha=0.6, color=color, s=30)
    markers.append(p)

plt.xlim([-5, 500])  # Truncate to reasonable limits
plt.ylim([-40, 40])
plt.xlabel("Days since last inspection", fontsize=14)
plt.ylabel("Change in score from last inspection", fontsize=14)
legend = plt.legend(markers, colordict.keys(), frameon=True, fontsize=12)
frame = legend.get_frame()
frame.set_facecolor("0.9")
frame.set_edgecolor("0.3")

Correlation between time since last inspection and change in score, by inspection type:
	[2nd Follow Up to 50 - 69]: r = -0.82, n = 3, mean change in score = 21.7
	[1st Follow Up to Routine of 50 - 69]: r = -0.00, n = 78, mean change in score = 21.4
	[Routine Inspection]: r = 0.04, n = 14968, mean change in score = -0.1

Here, for every pair of consecutive inspections in the dataset, we're plotting the number of days elapsed since the previous inspection (x-axis) against the change in score between inspections (y-axis). The cases are coloring by the kind of inspection (i.e., routine inspections vs. follow-up to earlier failed inspection). Note that follow-ups are invariably conducted within about a month of a failed inspection. More importantly, there's almost always a substantial improvement in score for these inspections (mean improvement of > 20 points), which is a good sign--and perhaps not surprising, given that repeated health code violations are a quick way for a restaurant to find its doors closed (a cynical statistician might also grumble about regression to the mean, but we'll let that go).

A related question the plot above lets us answer is whether there's any relationship between time since last inspection and change in score. Intuitively, you might think that maybe restaurants tend to "slip" over time, so that restaurants would do better if they were randomly re-inspected at fairly short intervals. But this doesn't appear to be the case; the correlation between the x and y dimensions for routine inspections is -.04, which is in the predicted direction, but probably not strong enough to write home about. So it doesn't look like more frequent inspections keep restaurants honest--though presumably they do allow inspectors to catch violators more quickly.

What else?

The health inspection dataset has its limits; as I mentioned above, the inspection records don't include some potentially useful features, like who carried out each inspection, how long the inspection lasted, what each restaurant got dinged for, and so on. Still, there are plenty of other questions we could ask with the existing data, given enough time and energy (want to contribute? Clone the open data flights repo!). For example:

Do health restaurant scores vary systematically across different parts of Austin?
Do health inspectors inspect a geographically random set of restaurants every day, or sample in a spatially predictable way?
To what extent can we predict when a given restaurant is likely to be inspected based on everything else we know (i.e., when the restaurant was last inspected, what other restaurants are being inspected that day, etc.)?
Are scores given to different restaurants on the same day more similar than one would expect? I.e., are there bad or good inspection days?
How well can we predict what score a given restaurant will receive if we (a) train and test on different sets of restaurants (i.e., each restaurant's inspections are all assigned to only the training or the testing set) and (b) train and test on different inspections (i.e., each restaurant has inspections randomly assigned to both the training and testing sets)?

In a nutshell...

So, summing up, what have we learned from this exercise? In no particular order:

Health inspections seem to work. When restaurants get caught with their (employees') hands in the cookie jar, they clean up the mess quickly.
Health inspectors don't seem to work--on Saturdays and Sundays, that is.
Health inspection scores are moderately stable across fairly long time intervals.
Fewer restaurants fail inspection than one would expect, for reasons that are unclear.
It's hard to get food poisoning from coffee and ice cream.
Avoid the buffet!