Let's walk through the process of identifying a data quality problem and then correcting it through NYPL's own interfaces. For the purposes of this experiment, I'll use some data that represents clusters of similar dish names that begin with non-ascii alphabetic characters …
SOURCE_FILE = '/Users/libraries/Code/menus-site/data/dishes/fixtures_by_alpha/_nonascii-fixture_data-reshaped_with_items.json'
import json
with open(SOURCE_FILE, 'r') as infile:
OUTLIER_DATA = json.loads(infile.read())
len(OUTLIER_DATA.keys())
31631
import re
def starts_with_number(obj):
if re.match(r'[0-9]', obj) is not None:
return True
else:
return False
NUMBER_OUTLIER_DATA = {k: v for k,v in OUTLIER_DATA.items() if starts_with_number(k) == True}
From previous exploration of the data, we know that some dishes are listed in the database even though they have a value of 0 for times_appeared
— Let's filter those out.
def find_nonnull_appearances(key):
resultlist = NUMBER_OUTLIER_DATA[key]
filtered_resultlist = [n for n in resultlist if n['times_appeared'] != 0]
return (key, filtered_resultlist)
APPEARS_1_OR_MORE = {k: v for k,v in NUMBER_OUTLIER_DATA.items() if find_nonnull_appearances(k)[1] != []}
len(APPEARS_1_OR_MORE.keys())
30671
Already in this set, I see two patterns of errors that I would like to investigate further:
Let's investigate the double zeros …
DOUBLE_0_START = {k: v for k,v in APPEARS_1_OR_MORE.items() if k.startswith('00')}
len(DOUBLE_0_START.keys())
685
By filtering for fingerprint values that start with numbers, from this selecting a particular pattern of "errors", and eliminating dishes that appear 0 times, we come out with a much-smaller set of possible test cases. Let's take a random sample of these …
import random
TEST_CASES = dict(random.sample(DOUBLE_0_START.items(), 20))
print(TEST_CASES.keys())
dict_keys(['00 1 2 champagne heidsieck pints', '00 2 cologne d from grunhauser leiden moselle', '00 1 crabflakes salad', '00 1 dinner', '00 1 charge person room service', '00 28 beetesauce gedämpft in lachsschnitte mangold petersilienkartoffeln rote', '00 19 duck pie s shepherd', '00 3 arlequin de mignon veau', '00 1825 2 and pale pts qts sherry yriarte', '00 1847 3 and chateau claret lafitte pts qts', '00 1836 3 imported in judge m madeira rich s story', '00 1 fresh hawaiian pineapple', '00 1844 3 and chateau claret margaux pts qts', '00 3 chicken fresh killed milkfed roast', '00 1 25 bacon fried remoulade scallops with', '00 14 20 and beans fava lobster radicchio risotto savory with', '00 58 99 balthazar de fruits grand le mer plateaux', '00 1 50 box cigarettes franc large per size smallsize', '00 1841 3 bottled castle chateau claret lafitte', '00 10 and cheese cream nova salmon scotia smoked'])
TEST_CASES['00 1 crabflakes salad']
[{'item_uri': ['http://menus.nypl.org/menu_items/944194/edit'], 'menus_appeared': 1, 'times_appeared': 1, 'name': 'Crabflakes Salad 1.00', 'page_uri': ['http://menus.nypl.org/menu_pages/63015'], 'dish_uri': 'http://menus.nypl.org/dishes/361169'}]
Here's the image link in the first of those item pages: http://j2k.repo.nypl.org/adore-djatoka/resolver?url_ver=Z39.88-2004&rft_id=urn:uuid:bcb737e9-be35-80f1-e040-e00a180630eb&svc_id=info:lanl-repo/svc/getRegion&svc_val_fmt=info:ofi/fmt:kev:mtx:jpeg2000&svc.format=image/jpeg&svc.rotate=0&svc.region=2306,1222,438,2085&svc.scale=950,200
IMG_URL = 'http://j2k.repo.nypl.org/adore-djatoka/resolver?url_ver=Z39.88-2004&rft_id=urn:uuid:bcb737e9-be35-80f1-e040-e00a180630eb&svc_id=info:lanl-repo/svc/getRegion&svc_val_fmt=info:ofi/fmt:kev:mtx:jpeg2000&svc.format=image/jpeg&svc.rotate=0&svc.region=2306,1222,438,2085&svc.scale=950,200'
from PIL import Image
import requests
img_req = requests.get(IMG_URL, stream=True)
if img_req.status_code == 200:
with open('/tmp/menu_data_downloads/menu_slice.jpg', 'wb') as savefile:
for chunk in img_req.iter_content():
savefile.write(chunk)
im = Image.open('/tmp/menu_data_downloads/menu_slice.jpg')
#copied from http://nbviewer.ipython.org/gist/deeplook/5162445
from io import BytesIO
from IPython.core import display
#from PIL import Image
def display_pil_image(im):
"""Displayhook function for PIL Images, rendered as PNG."""
b = BytesIO()
im.save(b, format='png')
data = b.getvalue()
ip_img = display.Image(data=data, format='png', embed=True)
return ip_img._repr_png_()
# register display func with PNG formatter:
png_formatter = get_ipython().display_formatter.formatters['image/png']
dpi = png_formatter.for_type(Image.Image, display_pil_image)
im
Now we can indeed see the problem — the price has been tacked onto to the end of the dish name
import os
Now, let's watch what happens through the API when we manually go and correct this error …
First, how many total dishes do we have at time t0?
payload = {"token" : os.environ['MENUS_API_KEY']}
dish_count = requests.get('http://api.menus.nypl.org/dishes/', params=payload)
stats = json.loads(dish_count.content.decode())['stats']
print(stats)
{'count': 410997}
Now let's see the current API responses for the dish above …
DISH = TEST_CASES['00 1 crabflakes salad'][0]
print(DISH)
{'item_uri': ['http://menus.nypl.org/menu_items/944194/edit'], 'menus_appeared': 1, 'times_appeared': 1, 'name': 'Crabflakes Salad 1.00', 'page_uri': ['http://menus.nypl.org/menu_pages/63015'], 'dish_uri': 'http://menus.nypl.org/dishes/361169'}
import datetime
import time
target_path = re.split('/', DISH['dish_uri'], maxsplit=3)
api_uri = 'http://api.menus.nypl.org/{0}'.format(target_path[-1])
req_t0 = requests.get(api_uri, params=payload)
print(api_uri)
print(datetime.datetime.now())
print(req_t0.status_code)
resp_t0 = json.loads(req_t0.content.decode())
time.sleep(0.5)
#And we'll grab the linked menu while we're at it
menu_api_uri = api_uri + '/menus'
menu_resp_t0 = json.loads(requests.get(menu_api_uri, params=payload).content.decode())
http://api.menus.nypl.org/dishes/361169 2014-10-11 14:22:00.220319 200
print(json.dumps(resp_t0, indent=2))
{ "last_appeared": 1933, "first_appeared": 1933, "links": [ { "href": "http://menus.nypl.org/api/dishes", "rel": "index" }, { "href": "http://menus.nypl.org/api/dishes/361169/menus", "rel": "menus" } ], "times_appeared": 1, "highest_price": null, "description": null, "menus_appeared": 1, "id": 361169, "lowest_price": null, "name": "Crabflakes Salad 1.00" }
Now, I'm going to go to http://menus.nypl.org/menu_items/944194/edit
and manually update the value and see what happens when "Crabflakes Salad 1.00" becomes "Crabflakes Salad"
target_path = re.split('/', DISH['dish_uri'], maxsplit=3)
api_uri = 'http://api.menus.nypl.org/{0}'.format(target_path[-1])
req_t1 = requests.get(api_uri, params=payload)
print(api_uri)
print(datetime.datetime.now())
print(req_t1.status_code)
resp_t1 = json.loads(req_t1.content.decode())
time.sleep(0.5)
#And we'll grab the linked menu while we're at it
menu_api_uri = api_uri + '/menus'
menu_resp_t1 = json.loads(requests.get(menu_api_uri, params=payload).content.decode())
http://api.menus.nypl.org/dishes/361169 2014-10-11 14:25:28.373109 404
So, now the same API request we executed a minute ago seems to 404 …
print(json.dumps(resp_t1, indent=2))
{ "error": "Dish Not Found" }
Let's go look at the webpage that loads after we make our edit…
from IPython.display import HTML
HTML('<iframe src=http://menus.nypl.org/menu_pages/63015?useformat=mobile width=700 height=600></iframe>')
Now "Crabflakes Salad" links to http://menus.nypl.org/dishes/289940
… Let's check that out via the API
NEW_URI = 'http://menus.nypl.org/dishes/289940'
target_path = re.split('/', NEW_URI, maxsplit=3)
api_uri = 'http://api.menus.nypl.org/{0}'.format(target_path[-1])
req_t2 = requests.get(api_uri, params=payload)
print(api_uri)
print(datetime.datetime.now())
print(req_t2.status_code)
resp_t2 = json.loads(req_t2.content.decode())
time.sleep(0.5)
#And we'll grab the linked menu while we're at it
menu_api_uri = api_uri + '/menus'
menu_resp_t2 = json.loads(requests.get(menu_api_uri, params=payload).content.decode())
http://api.menus.nypl.org/dishes/289940 2014-10-11 14:44:49.779602 200
print(json.dumps(resp_t2, indent=2))
{ "last_appeared": 1933, "first_appeared": 1933, "links": [ { "href": "http://menus.nypl.org/api/dishes", "rel": "index" }, { "href": "http://menus.nypl.org/api/dishes/289940/menus", "rel": "menus" } ], "times_appeared": 60, "highest_price": "$3.25", "description": null, "menus_appeared": 60, "id": 289940, "lowest_price": "$0.90", "name": "Crabflakes Salad" }
dish_count_t2 = requests.get('http://api.menus.nypl.org/dishes/', params=payload)
stats_t2 = json.loads(dish_count_t2.content.decode())['stats']
print(stats)
{'count': 410997}
No "new" dish has appeared, so it looks like this has quickly gotten lumped in with other appearances of "Crabflakes Salad"? Obviously, I can't go back in time to check but I can inspect the same information from the latest published data dump …
import pandas as pd
!ls /tmp/menu_data_downloads/2014_10_01/
Dish.csv Menu.csv MenuItem.csv MenuPage.csv
OCT_1_DATA_DF = pd.DataFrame.from_csv('/tmp/menu_data_downloads/2014_10_01/Dish.csv', index_col='id')
OCT_1_DATA_DF[OCT_1_DATA_DF.index == 361169]
name | description | menus_appeared | times_appeared | first_appeared | last_appeared | lowest_price | highest_price | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
361169 | Crabflakes Salad 1.00 | NaN | 1 | 1 | 1933 | 1933 | 0 | 0 |
There's our original entry that we just changed …
OCT_1_DATA_DF[OCT_1_DATA_DF.index == 289940]
name | description | menus_appeared | times_appeared | first_appeared | last_appeared | lowest_price | highest_price | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
289940 | Crabflakes Salad | NaN | 59 | 59 | 1933 | 1933 | 0.9 | 3.25 |
Yep! Now we can see the difference — in the static version of the data there were only 59 appearances of "Crabflakes Salad" but the API now returns "times_appeared": 60
This makes tracking and managing updates a bit more complicated because instead of:
from IPython.display import Image
embed1 = Image('menu_data_updates1.png')
embed1
We get:
embed2 = Image('menu_data_updates2.png')
embed2