import pandas as pd
%pylab inline
Populating the interactive namespace from numpy and matplotlib
restaurant_ratings = pd.read_csv('dumps/munged_results.csv')
To observe and model out (if it exists) the relationship between a restaurant's sanitary rating by New York's Department of Health and Mental Health (DOHMH) and its FourSquare rating. Additionally, the data and models developed for this project could also be used to identify any existing relationahips between the cuisine and its restaurant rating.
The primary predicting element for this project will be a restaurant's rating and it's cusine. Not all records in the dataset have FourSquare ratings.
Features in this dataset include:
I chose these features because I believe that for certain cuisines, customers are willing to forgive lower restaurant ratings.
restaurant_ratings.columns.values
array(['Unnamed: 0', 'CAMIS', 'DBA', 'BORO', 'BUILDING', 'STREET', 'ZIPCODE', 'PHONE', 'CUISINE DESCRIPTION', 'INSPECTION DATE', 'ACTION', 'VIOLATION CODE', 'VIOLATION DESCRIPTION', 'CRITICAL FLAG', 'SCORE', 'GRADE', 'GRADE DATE', 'RECORD DATE', 'INSPECTION TYPE', 'foursquare_id', 'foursquare_rating', 'foursquare_category', 'foursquare_num_of_users'], dtype=object)
restaurant_ratings['foursquare_rating'].hist()
<matplotlib.axes.AxesSubplot at 0x106f8e210>
subset_ratings = restaurant_ratings[0:50]
grade_dummy_features = pd.get_dummies(restaurant_ratings['GRADE'])
restaurant_ratings = restaurant_ratings.join(grade_dummy_features)
cuisine_dummary_features = pd.get_dummies(restaurant_ratings['CUISINE DESCRIPTION'])
foursquare_cuisine_dummy_features = pd.get_dummies(restaurant_ratings['foursquare_category'])
In preparing the data for this project, a few things had to be done:
import sklearn
from sklearn.cross_validation import train_test_split
train_test_split??
rr_train, rr_test = train_test_split(restaurant_ratings)
rr_train
array([[9710, 40372618, 'BURGER KING', ..., 0.0, 0.0, 0.0], [525727, 50016018, 'DON PANCHO VILLA RESTAURANT', ..., 0.0, 1.0, 0.0], [509346, 50005216, 'BROOKLYN NIGHTS', ..., 0.0, 0.0, 0.0], ..., [483871, 50000100, 'NSE FIFTY SIX', ..., 0.0, 0.0, 0.0], [519932, 50011045, 'CHINA WOK KING CORP', ..., 0.0, 0.0, 0.0], [286575, 41436866, "MCDONALD'S", ..., 0.0, 0.0, 0.0]], dtype=object)