Understanding How menus.nypl.org Handles Data Updates¶

11 October 2014

Let's walk through the process of identifying a data quality problem and then correcting it through NYPL's own interfaces. For the purposes of this experiment, I'll use some data that represents clusters of similar dish names that begin with non-ascii alphabetic characters …

In [1]:

SOURCE_FILE = '/Users/libraries/Code/menus-site/data/dishes/fixtures_by_alpha/_nonascii-fixture_data-reshaped_with_items.json'

In [2]:

import json

In [3]:

with open(SOURCE_FILE, 'r') as infile:
    OUTLIER_DATA = json.loads(infile.read())

In [4]:

len(OUTLIER_DATA.keys())

Out[4]:

In [5]:

import re

def starts_with_number(obj):
    if re.match(r'[0-9]', obj) is not None:
        return True
    else:
        return False

In [6]:

NUMBER_OUTLIER_DATA = {k: v for k,v in OUTLIER_DATA.items() if starts_with_number(k) == True}

From previous exploration of the data, we know that some dishes are listed in the database even though they have a value of 0 for times_appeared — Let's filter those out.

In [7]:

def find_nonnull_appearances(key):
    resultlist = NUMBER_OUTLIER_DATA[key]
    filtered_resultlist = [n for n in resultlist if n['times_appeared'] != 0]
    return (key, filtered_resultlist)

In [8]:

APPEARS_1_OR_MORE = {k: v for k,v in NUMBER_OUTLIER_DATA.items() if find_nonnull_appearances(k)[1] != []}

In [9]:

len(APPEARS_1_OR_MORE.keys())

Out[9]:

Already in this set, I see two patterns of errors that I would like to investigate further:

Where double zeroes get sorted to the beginning of a key — someone has probably transcribed the price as part of the dish name
Where 4 digits get sorted to the beginning of a key this is probably a year, probably for booze (in other words, an outlier value but not an error)

Let's investigate the double zeros …

In [10]:

DOUBLE_0_START = {k: v for k,v in APPEARS_1_OR_MORE.items() if k.startswith('00')}

In [11]:

len(DOUBLE_0_START.keys())

Out[11]:

By filtering for fingerprint values that start with numbers, from this selecting a particular pattern of "errors", and eliminating dishes that appear 0 times, we come out with a much-smaller set of possible test cases. Let's take a random sample of these …

In [12]:

import random

In [13]:

TEST_CASES = dict(random.sample(DOUBLE_0_START.items(), 20))

In [14]:

print(TEST_CASES.keys())

dict_keys(['00 1 2 champagne heidsieck pints', '00 2 cologne d from grunhauser leiden moselle', '00 1 crabflakes salad', '00 1 dinner', '00 1 charge person room service', '00 28 beetesauce gedämpft in lachsschnitte mangold petersilienkartoffeln rote', '00 19 duck pie s shepherd', '00 3 arlequin de mignon veau', '00 1825 2 and pale pts qts sherry yriarte', '00 1847 3 and chateau claret lafitte pts qts', '00 1836 3 imported in judge m madeira rich s story', '00 1 fresh hawaiian pineapple', '00 1844 3 and chateau claret margaux pts qts', '00 3 chicken fresh killed milkfed roast', '00 1 25 bacon fried remoulade scallops with', '00 14 20 and beans fava lobster radicchio risotto savory with', '00 58 99 balthazar de fruits grand le mer plateaux', '00 1 50 box cigarettes franc large per size smallsize', '00 1841 3 bottled castle chateau claret lafitte', '00 10 and cheese cream nova salmon scotia smoked'])

In [15]:

TEST_CASES['00 1 crabflakes salad']

Out[15]:

[{'item_uri': ['http://menus.nypl.org/menu_items/944194/edit'],
  'menus_appeared': 1,
  'times_appeared': 1,
  'name': 'Crabflakes Salad 1.00',
  'page_uri': ['http://menus.nypl.org/menu_pages/63015'],
  'dish_uri': 'http://menus.nypl.org/dishes/361169'}]

Here's the image link in the first of those item pages: http://j2k.repo.nypl.org/adore-djatoka/resolver?url_ver=Z39.88-2004&rft_id=urn:uuid:bcb737e9-be35-80f1-e040-e00a180630eb&svc_id=info:lanl-repo/svc/getRegion&svc_val_fmt=info:ofi/fmt:kev:mtx:jpeg2000&svc.format=image/jpeg&svc.rotate=0&svc.region=2306,1222,438,2085&svc.scale=950,200

In [16]:

IMG_URL = 'http://j2k.repo.nypl.org/adore-djatoka/resolver?url_ver=Z39.88-2004&rft_id=urn:uuid:bcb737e9-be35-80f1-e040-e00a180630eb&svc_id=info:lanl-repo/svc/getRegion&svc_val_fmt=info:ofi/fmt:kev:mtx:jpeg2000&svc.format=image/jpeg&svc.rotate=0&svc.region=2306,1222,438,2085&svc.scale=950,200'

In [17]:

from PIL import Image

In [18]:

import requests

In [19]:

img_req = requests.get(IMG_URL, stream=True)
if img_req.status_code == 200:
    with open('/tmp/menu_data_downloads/menu_slice.jpg', 'wb') as savefile:
        for chunk in img_req.iter_content():
            savefile.write(chunk)

In [20]:

im = Image.open('/tmp/menu_data_downloads/menu_slice.jpg')

In [21]:

#copied from http://nbviewer.ipython.org/gist/deeplook/5162445

from io import BytesIO

from IPython.core import display
#from PIL import Image


def display_pil_image(im):
   """Displayhook function for PIL Images, rendered as PNG."""

   b = BytesIO()
   im.save(b, format='png')
   data = b.getvalue()

   ip_img = display.Image(data=data, format='png', embed=True)
   return ip_img._repr_png_()


# register display func with PNG formatter:
png_formatter = get_ipython().display_formatter.formatters['image/png']
dpi = png_formatter.for_type(Image.Image, display_pil_image)

In [22]:

im

Out[22]:

Now we can indeed see the problem — the price has been tacked onto to the end of the dish name

In [23]:

import os

Now, let's watch what happens through the API when we manually go and correct this error …

First, how many total dishes do we have at time t0?

In [24]:

payload = {"token" : os.environ['MENUS_API_KEY']} 

dish_count = requests.get('http://api.menus.nypl.org/dishes/', params=payload)
stats = json.loads(dish_count.content.decode())['stats']
print(stats)

{'count': 410997}

Now let's see the current API responses for the dish above …

In [25]:

DISH = TEST_CASES['00 1 crabflakes salad'][0]
print(DISH)

{'item_uri': ['http://menus.nypl.org/menu_items/944194/edit'], 'menus_appeared': 1, 'times_appeared': 1, 'name': 'Crabflakes Salad 1.00', 'page_uri': ['http://menus.nypl.org/menu_pages/63015'], 'dish_uri': 'http://menus.nypl.org/dishes/361169'}

In [29]:

import datetime
import time

target_path = re.split('/', DISH['dish_uri'], maxsplit=3)
api_uri = 'http://api.menus.nypl.org/{0}'.format(target_path[-1])

req_t0 = requests.get(api_uri, params=payload)
print(api_uri)
print(datetime.datetime.now())
print(req_t0.status_code)
resp_t0 = json.loads(req_t0.content.decode())
time.sleep(0.5)

#And we'll grab the linked menu while we're at it
menu_api_uri = api_uri + '/menus'
menu_resp_t0 = json.loads(requests.get(menu_api_uri, params=payload).content.decode())

http://api.menus.nypl.org/dishes/361169
2014-10-11 14:22:00.220319
200

In [30]:

print(json.dumps(resp_t0, indent=2))

{
  "last_appeared": 1933,
  "first_appeared": 1933,
  "links": [
    {
      "href": "http://menus.nypl.org/api/dishes",
      "rel": "index"
    },
    {
      "href": "http://menus.nypl.org/api/dishes/361169/menus",
      "rel": "menus"
    }
  ],
  "times_appeared": 1,
  "highest_price": null,
  "description": null,
  "menus_appeared": 1,
  "id": 361169,
  "lowest_price": null,
  "name": "Crabflakes Salad 1.00"
}

Now, I'm going to go to http://menus.nypl.org/menu_items/944194/edit and manually update the value and see what happens when "Crabflakes Salad 1.00" becomes "Crabflakes Salad"

In [33]:

target_path = re.split('/', DISH['dish_uri'], maxsplit=3)
api_uri = 'http://api.menus.nypl.org/{0}'.format(target_path[-1])

req_t1 = requests.get(api_uri, params=payload)
print(api_uri)
print(datetime.datetime.now())
print(req_t1.status_code)
resp_t1 = json.loads(req_t1.content.decode())
time.sleep(0.5)

#And we'll grab the linked menu while we're at it
menu_api_uri = api_uri + '/menus'
menu_resp_t1 = json.loads(requests.get(menu_api_uri, params=payload).content.decode())

http://api.menus.nypl.org/dishes/361169
2014-10-11 14:25:28.373109
404

So, now the same API request we executed a minute ago seems to 404 …

In [35]:

print(json.dumps(resp_t1, indent=2))

{
  "error": "Dish Not Found"
}

Let's go look at the webpage that loads after we make our edit…

In [63]:

from IPython.display import HTML
HTML('<iframe src=http://menus.nypl.org/menu_pages/63015?useformat=mobile width=700 height=600></iframe>')

Out[63]:

Now "Crabflakes Salad" links to http://menus.nypl.org/dishes/289940 … Let's check that out via the API

In [50]:

NEW_URI = 'http://menus.nypl.org/dishes/289940'

target_path = re.split('/', NEW_URI, maxsplit=3)
api_uri = 'http://api.menus.nypl.org/{0}'.format(target_path[-1])

req_t2 = requests.get(api_uri, params=payload)
print(api_uri)
print(datetime.datetime.now())
print(req_t2.status_code)
resp_t2 = json.loads(req_t2.content.decode())
time.sleep(0.5)

#And we'll grab the linked menu while we're at it
menu_api_uri = api_uri + '/menus'
menu_resp_t2 = json.loads(requests.get(menu_api_uri, params=payload).content.decode())

http://api.menus.nypl.org/dishes/289940
2014-10-11 14:44:49.779602
200

In [51]:

print(json.dumps(resp_t2, indent=2))

{
  "last_appeared": 1933,
  "first_appeared": 1933,
  "links": [
    {
      "href": "http://menus.nypl.org/api/dishes",
      "rel": "index"
    },
    {
      "href": "http://menus.nypl.org/api/dishes/289940/menus",
      "rel": "menus"
    }
  ],
  "times_appeared": 60,
  "highest_price": "$3.25",
  "description": null,
  "menus_appeared": 60,
  "id": 289940,
  "lowest_price": "$0.90",
  "name": "Crabflakes Salad"
}

In [53]:

dish_count_t2 = requests.get('http://api.menus.nypl.org/dishes/', params=payload)
stats_t2 = json.loads(dish_count_t2.content.decode())['stats']
print(stats)

{'count': 410997}

No "new" dish has appeared, so it looks like this has quickly gotten lumped in with other appearances of "Crabflakes Salad"? Obviously, I can't go back in time to check but I can inspect the same information from the latest published data dump …

In [54]:

import pandas as pd

In [55]:

!ls /tmp/menu_data_downloads/2014_10_01/

Dish.csv     Menu.csv     MenuItem.csv MenuPage.csv

In [57]:

OCT_1_DATA_DF = pd.DataFrame.from_csv('/tmp/menu_data_downloads/2014_10_01/Dish.csv', index_col='id')

In [59]:

OCT_1_DATA_DF[OCT_1_DATA_DF.index == 361169]

Out[59]:

	name	description	menus_appeared	times_appeared	first_appeared	last_appeared	lowest_price	highest_price
id
361169	Crabflakes Salad 1.00	NaN	1	1	1933	1933	0	0

There's our original entry that we just changed …

In [60]:

OCT_1_DATA_DF[OCT_1_DATA_DF.index == 289940]

Out[60]:

	name	description	menus_appeared	times_appeared	first_appeared	last_appeared	lowest_price	highest_price
id
289940	Crabflakes Salad	NaN	59	59	1933	1933	0.9	3.25

Yep! Now we can see the difference — in the static version of the data there were only 59 appearances of "Crabflakes Salad" but the API now returns "times_appeared": 60

This makes tracking and managing updates a bit more complicated because instead of:

In [65]:

from IPython.display import Image
embed1 = Image('menu_data_updates1.png')
embed1

Out[65]:

We get:

In [66]:

embed2 = Image('menu_data_updates2.png')
embed2

Out[66]:

In [ ]: