As pointed out by [this](http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?_r=0) article from New York Times, the data wrangle, or “data janitor work”, requires from 50 percent to 80 percent of the time expended in a data science project. After downloading a sample of the São Paulo (Brazil) area, where I live, and start to clean it, I believe that I can understand what they mean by that.
import os
import data_wrangling as reshape
from data_wrangling import *
reload(reshape);
The first thing that I had to do was adapting the code provided by the classes to my data. For instance, the street type (Street, Avenue and so on) is located at the beginning of the street names in Brazil, not at the end, as in the sample that we used at the course. Also, as in the sample provided, the street type is usually shortened.
As personal names are usually used to name streets in Brazil, another common feature related to street names is the appearance of a "Title Name" after the street type, as descibed in the documentation of the São Paulo street names data set provided by the Center for metropolitan Studies* of São Paulo (CMS). For example, in Captain so-and-so street, "Captain" would be the title name. As street types, title names usually are shortened as well.
*a brazilian think tank that performs researches about the role of public policy in reducing poverty and inequality
Let's start exploring the data set. First , I checked the size of the file that I had to deal with:
fr_dbf="/Users/ucaiado/Documents/temp/LOG2014_CEM_RMSP.DBF"
filename='/Users/ucaiado/Dropbox/NEUTRINO/ALGO/DATA/sao-paulo_brazil.osm'
print "{} MB".format(os.stat(filename).st_size/1024**2)
315 MB
Then, I checked the tags found at the XML file, some possible problems and the number of unique users that contributed to the data set to feel what I would have to do.
%time reshape.initial_tests(filename)
# of tags: {'bounds': 1, 'member': 38702, 'nd': 1826215, 'node': 1475441, 'osm': 1, 'relation': 4516, 'tag': 584178, 'way': 198120} # of potential problems on the data: {'lower': 554488, 'lower_colon': 28594, 'other': 1091, 'problemchars': 5} # of unique users that contributed to this map: 1189 CPU times: user 48 s, sys: 464 ms, total: 48.4 s Wall time: 48.5 s
To correct the street names, I had to create a dictionary of shortened words related to street types and another related to title names. The following street names are some examples of what was corrected in the OSM file:
Av Jacú Pessego / Nova Trabalhadores => Avenida Jacú Pessego AV PEDROSO DE MORAIS => Avenida Pedroso De Morais Avenida Pres. Arthur Bernardes => Avenida Presidente Arthur Bernardes Al. Santos => Alameda Santos Av. Prof. Lúcio Martins Rodrigues, travessas 4 e 5 => Avenida Professor Lúcio Martins Rodrigues
Another problem was the misspelled CITY names. There are 15 names that was wrote wrongly in the data set. In order to correct those, I adopted a similar approach to I used to correct street and title names: I iterated through the data set over and over until come up with a dictionary of wrong words (in this case, wrong names). It can be checked in the data_wrangling.py file in my github repository, as well as the dictionaries used to correct the street names. Here are some of the city names that were corrected:
são paulo - sp => São Paulo são paulo/sp => São Paulo guaruja => Guarujá
Finally, I reshaped all the OSM file with the aim of insert it into MongoDB. Here are some statistics about what was done on the process:
%time reshape.process_map(filename, pretty = False)
# of documents created: 1673561 # of different street names: 1533 # of docs where the street name was fixed: 6310 # of docs where the city name was fixed: 5048 CPU times: user 1min 27s, sys: 6.39 s, total: 1min 34s Wall time: 1min 34s
I used the following code to insert the data into mongoDB. This method was quite fast...
mongoimport -db-c --file /path/to/the/file.json
After putting the data in Mongo, I checked if it was done correctly. First, I connected to my local database:
from pymongo import MongoClient
client=MongoClient('localhost:27017')
db=client.udacity
Then, I checked if the city names are correct. It is expected 40 different city names:
pipeline=[
{"$group": {"_id": None, "cities":{"$addToSet":"$address.city"}}},
{"$unwind": "$cities"},
{"$group": { "_id": "Total City Names in the data", "count": { "$sum": 1 } } }
]
aggregate(db.osm, pipeline)
[{u'_id': u'Total City Names in the data', u'count': 40}]
Afterward, I computed the number of different street names in the database (1533 expected):
pipeline=[
{"$group": {"_id": None, "street":{"$addToSet":"$address.street"}}},
{"$unwind": "$street"},
{"$group": { "_id": "Total number of different street names in the data", "count": { "$sum": 1 } } }
]
pprint.pprint(aggregate(db.osm, pipeline)[0])
{u'_id': u'Total number of different street names in the data', u'count': 1533}
Lastly, just to make sure that the OSM file was correctly process by my codes, I counted the number of unique users shown in database (1189 expected):
pipeline=[
{"$match": {"type": "node"}},
{"$group": {"_id": None, "user":{"$addToSet":"$created.uid"}}},
{"$unwind": "$user"},
{"$group": { "_id": "Number of unique user in the data", "count": { "$sum": 1 } } }
]
pprint.pprint(aggregate(db.osm, pipeline)[0])
{u'_id': u'Number of unique user in the data', u'count': 1189}
As I showed earlier, there are 40 different cities in the database. The way that the unique street names was counted before is not correct for this data set , as far as the same street name in different cities is computed as the same street, but indeed they are different streets. In order to correct that, I had to group the street names by city and then counting the street names.
pipeline=[
{"$group": {"_id": "$address.city", "street":{"$addToSet":"$address.street"}}},
{"$unwind": "$street"},
{"$group": { "_id": "Number of unique street names in the data, taking into account the city name",
"count": { "$sum": 1 } } }
]
pprint.pprint(aggregate(db.osm, pipeline))
[{u'_id': u'Number of unique street names in the data, taking into account the city name', u'count': 1734}]
The number of streets, 1734, already seems small for 40 cities, given that one of them, São Paulo, is one of the biggest cities on the planet. Following the analysis, I computed the top 5 amenities on each city, querying it directly from MongoDB. It was tricky, as I did not find how I could "pivot table" the data directly from MongoDb using a single path. I solved it in the following way. First, I had to aggregate the data in order to create the needed data set structure for this analysis:
pipeline=[
{"$match":{"amenity":{"$exists":1}}},
{"$match":{"address.city":{"$exists":1}}},
{"$group": {"_id":{"city":"$address.city", "amenity":"$amenity"}, "count": {"$sum":1}}},
{"$sort" : {"count" : -1}},
{"$project": {"city": '$_id.city', 'amenity': '$_id.amenity', 'count':'$count'}},
{"$group":{'_id':"$city", 'result':{'$push':{"amenity":"$amenity","count":"$count"}}}},
{"$out": "osm_aux"}
]
x=aggregate(db.osm, pipeline)
I used two Group stages on the above pipeline *. First, I counted the data using the city name and the amenity as keys. Second, I reshaped the structure of this output in a new one, which I reshaped again using just the city as key while keeping the amenity and the count in a result list.
Then, I inserted this output in a new collection called "osm_aux". Below I updated it to limit the size of each result list to 5.
* I took this approach from [here](http://stackoverflow.com/questions/25711657/mongodb-limit-array-within-aggregate-query)
my_collection=db.osm_aux
my_collection.update({},{"$push" : {"result": {"$each": [], "$slice":5}}},multi=True)
{u'n': 29, u'nModified': 7, u'ok': 1, 'updatedExisting': True}
Finally, here is the top 5 amenities by city:
for x in my_collection.find():
print "city: " + (x["_id"])
pprint.pprint(x["result"])
print ""
city: Arujá [{u'amenity': u'Restaurant', u'count': 1}] city: Jacareí [{u'amenity': u'Place_Of_Worship', u'count': 1}] city: Barueri [{u'amenity': u'Bank', u'count': 2}] city: Santos [{u'amenity': u'Pharmacy', u'count': 3}, {u'amenity': u'Restaurant', u'count': 2}, {u'amenity': u'Bank', u'count': 2}, {u'amenity': u'Place_Of_Worship', u'count': 1}, {u'amenity': u'Fast_Food', u'count': 1}] city: Mauá [{u'amenity': u'School', u'count': 1}, {u'amenity': u'Place_Of_Worship', u'count': 1}] city: Itaquaquecetuba [{u'amenity': u'Fast_Food', u'count': 1}] city: Ferraz De Vasconcelos [{u'amenity': u'Place_Of_Worship', u'count': 1}, {u'amenity': u'School', u'count': 1}] city: São Vicente [{u'amenity': u'Hospital', u'count': 1}] city: Itú [{u'amenity': u'Place_Of_Worship', u'count': 1}] city: Cubatão [{u'amenity': u'Place_Of_Worship', u'count': 1}] city: Jundiaí [{u'amenity': u'Grave_Yard', u'count': 1}, {u'amenity': u'Fuel', u'count': 1}, {u'amenity': u'Place_Of_Worship', u'count': 1}, {u'amenity': u'Clinic', u'count': 1}, {u'amenity': u'Townhall', u'count': 1}] city: Ibiúna [{u'amenity': u'Public_Building', u'count': 1}] city: Carapicuíba [{u'amenity': u'College', u'count': 1}] city: Osasco [{u'amenity': u'Hospital', u'count': 1}, {u'amenity': u'School', u'count': 1}] city: Francisco Morato [{u'amenity': u'School', u'count': 1}] city: Mogi Das Cruzes [{u'amenity': u'Hospital', u'count': 1}, {u'amenity': u'Cafe', u'count': 1}] city: São Caetano Do Sul [{u'amenity': u'College', u'count': 2}, {u'amenity': u'Restaurant', u'count': 1}, {u'amenity': u'Pharmacy', u'count': 1}, {u'amenity': u'Hospital', u'count': 1}, {u'amenity': u'School', u'count': 1}] city: São Bernardo Do Campo [{u'amenity': u'Parking', u'count': 94}, {u'amenity': u'School', u'count': 92}, {u'amenity': u'Fuel', u'count': 61}, {u'amenity': u'Restaurant', u'count': 61}, {u'amenity': u'Bank', u'count': 55}] city: Embu Das Artes [{u'amenity': u'Bank', u'count': 4}, {u'amenity': u'Parking', u'count': 4}, {u'amenity': u'Arts_Centre', u'count': 1}, {u'amenity': u'Place_Of_Worship', u'count': 1}] city: Guarulhos [{u'amenity': u'Fuel', u'count': 3}, {u'amenity': u'School', u'count': 2}, {u'amenity': u'University', u'count': 2}, {u'amenity': u'Place_Of_Worship', u'count': 2}, {u'amenity': u'Fire_Station', u'count': 1}] city: Mairiporã [{u'amenity': u'Parking', u'count': 1}] city: São José Dos Campos [{u'amenity': u'Community_Centre', u'count': 2}, {u'amenity': u'Hospital', u'count': 1}, {u'amenity': u'Place_Of_Worship', u'count': 1}, {u'amenity': u'Cafe', u'count': 1}, {u'amenity': u'Dentist', u'count': 1}] city: Cotia [{u'amenity': u'Place_Of_Worship', u'count': 2}, {u'amenity': u'Pharmacy', u'count': 2}, {u'amenity': u'Bank', u'count': 1}, {u'amenity': u'Fuel', u'count': 1}] city: Suzano [{u'amenity': u'Fuel', u'count': 1}] city: Itanhaém [{u'amenity': u'Townhall', u'count': 1}] city: Diadema [{u'amenity': u'Fuel', u'count': 16}, {u'amenity': u'Place_Of_Worship', u'count': 13}, {u'amenity': u'Bank', u'count': 12}, {u'amenity': u'School', u'count': 8}, {u'amenity': u'Restaurant', u'count': 7}] city: Santo André [{u'amenity': u'Bank', u'count': 28}, {u'amenity': u'School', u'count': 9}, {u'amenity': u'Post_Office', u'count': 8}, {u'amenity': u'Restaurant', u'count': 5}, {u'amenity': u'Place_Of_Worship', u'count': 5}] city: São Paulo [{u'amenity': u'Restaurant', u'count': 97}, {u'amenity': u'School', u'count': 70}, {u'amenity': u'Fuel', u'count': 57}, {u'amenity': u'Bank', u'count': 50}, {u'amenity': u'Parking', u'count': 49}] city: Taboão Da Serra [{u'amenity': u'Fuel', u'count': 1}, {u'amenity': u'School', u'count': 1}]
Summing up just the top 5 amenities by city, something curious came to the fore:
pipeline=[
{"$unwind": "$result"},
{"$group": {"_id":"$_id", "sum": {"$sum":"$result.count"}}},
{"$sort" : {"sum" : -1}},
{"$limit": 10}
]
x=aggregate(db.osm_aux, pipeline)
pprint.pprint(x)
[{u'_id': u'S\xe3o Bernardo Do Campo', u'sum': 363}, {u'_id': u'S\xe3o Paulo', u'sum': 323}, {u'_id': u'Diadema', u'sum': 56}, {u'_id': u'Santo Andr\xe9', u'sum': 55}, {u'_id': u'Guarulhos', u'sum': 10}, {u'_id': u'Embu Das Artes', u'sum': 10}, {u'_id': u'Santos', u'sum': 9}, {u'_id': u'S\xe3o Caetano Do Sul', u'sum': 6}, {u'_id': u'Cotia', u'sum': 6}, {u'_id': u'S\xe3o Jos\xe9 Dos Campos', u'sum': 6}]
The top 5 most frequently amenities are not in São Paulo, but in São Bernardo do Campo, what is near São Paulo. São Bernardo is not a small city, but for sure that it is much smaller then São Paulo. To draw valid conclusion from this data set I believe that I would need a data set more complete and proportional across the cities. Just to point out, the most frequent amenity in São Bernardo was parking.
Latelly, as the new collection already fullfuield its mission, I can just discard it.
my_collection.drop()
In order to support what I have claimed about the size of the database,remembering that the total number of unique street names in the 40 cities in mongo was 1734, I used the streets data set provided by the CMS to count different street names just in São Paulo and São Bernardo:
%time set_aux=count_streets_in_dbf(fr_dbf)
i_saopaulo=0
i_saobernardo=0
for x in set_aux:
if x.split("|")[1].strip()=='sao paulo': i_saopaulo+=1
if x.split("|")[1].strip()=='sao bernardo do campo': i_saobernardo+=1
CPU times: user 30.1 s, sys: 844 ms, total: 31 s Wall time: 31 s
print "Unique Street Names in São Paulo: {}\nUnique Street Names in São Bernardo: {}".format(i_saopaulo,i_saobernardo)
Unique Street Names in São Paulo: 43103 Unique Street Names in São Bernardo: 3289
As I was expecting, the OSM file from São Paulo area has too few records. Just São Paulo city has 20 times more data points than what was computed using the OSM data set of 40 cities .
I believe that this incompleteness in addresses can be a hurdle to new contributions, given that besides contributing with locations on the map, like coffees and shops, the user also would need to contribute with the addresses, something that people often have no idea. This problem can end up preventing apps use this database to provide their services, apps that could encourage contributions with new data.
A way to overcome this issue would be cross the database with other sources to fill in the missing addresses, like the CMS streets data set used above. However, matching different data sets can be really challenging because open databases usually present a lot of misspelled street names, given that the portuguese language has a lot of grammar rules that people often bypass. If I had to do that, I probably would replace the special characters by their "not-special"* counterparts (ç by c, á by a) and then I would create a dictionary of "wrong words => right words" before matching the existing data points.
* An useful package to achieve that.
I believe that the Open Street Map is satisfatory for the purpose of the data wrangle exercise but too incomplete to drawing statistical conclusions. I suggested that other data sets could be used to input more addresses into the OpenStreetMap.org and that it could stimulate others to contribute with more locations.
CSS Style