name = "2015-10-26-007"
title = "NLTK, maps, and 007"
%matplotlib inline
import os
from datetime import datetime
from IPython.core.display import HTML
with open('creative_commons.txt', 'r') as f:
html = f.read()
hour = datetime.utcnow().strftime('%H:%M')
comments="true"
date = '-'.join(name.split('-')[:3])
slug = '-'.join(name.split('-')[3:])
metadata = dict(title=title,
date=date,
hour=hour,
comments=comments,
slug=slug,
name=name)
markdown = """Title: {title}
date: {date} {hour}
comments: {comments}
slug: {slug}
{{% notebook {name}.ipynb cells[2:] %}}
""".format(**metadata)
content = os.path.abspath(os.path.join(os.getcwd(),
os.pardir,
os.pardir,
'{}.md'.format(name)))
with open('{}'.format(content), 'w') as f:
f.writelines(markdown)
html = '''
<small>
<p> This post was written as an IPython notebook.
It is available for <a href='https://ocefpaf.github.com/python4oceanographers/downloads/notebooks/%s.ipynb'>download</a>
or as a static <a href='https://nbviewer.ipython.org/url/ocefpaf.github.com/python4oceanographers/downloads/notebooks/%s.ipynb'>html</a>.</p>
<p></p>
%s''' % (name, name, html)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-1-fa0f0f37d63a> in <module>() 11 comments="true" 12 ---> 13 date = '-'.join(name.split('-')[:3]) 14 slug = '-'.join(name.split('-')[3:]) 15 NameError: name 'name' is not defined
The Natural Language Toolkit (NLTK) is on of those amazing piece of Software that surprises us just by the fact that it works. Just Google it and be amazed!
Recently I found the module geograpy which extract geographical information (countries, regions and cities) from texts using NLTK.
So why not make a post with something I always wanted to try (NLTK), with something I like (maps)? First let's take a look at how geograpy
works.
import geograpy
text = "Paris is the city of love!"
places = geograpy.get_place_context(text=text)
places
<geograpy.places.PlaceContext at 0x7f26bd1d53d0>
Cool! We get a PlaceContext
object. Let's explore this object.
places.cities
[u'Paris']
places.countries
[u'France', u'United States', u'Canada']
Anyone reading the phrase above knows I meant the mud tidal flat city of Lutetia and not the other two cities named Paris in USA and CA. That is a limitation of automated text parsers. They cannot interpret the context like humans do. Still... That is pretty amazing!
This gets even more complicated when the city name is British*. You'll get returns all over the Empire!
* Or maybe worse for Spanish names if geograpy could do other languages than English. Portuguese would be OK though. They made a habit of naming things using local features. There are no New Lisbon anywhere in Brazil.
city = "Victoria"
text = "How many countries with a city named {} can you find?".format(city)
places = geograpy.get_place_context(text=text)
countries = places.countries
print('Found {} countries for the city {}:\n{}'.format(len(countries), city, ', '.join(countries)))
Found 12 countries for the city Victoria: United States, Canada, Seychelles, United Kingdom, Malta, Romania, Malaysia, Mexico, Chile, Argentina, Trinidad and Tobago, Panama
OK. I cheated by choosing Victoria. Let's try a more British-like name.
city = "Richmond"
text = "How many countries with a city named {} can you find?".format(city)
places = geograpy.get_place_context(text=text)
countries = places.countries
print('Found {} countries for the city {}:\n{}'.format(len(countries), city, ', '.join(countries)))
Found 4 countries for the city Richmond: United Kingdom, United States, Australia, Canada
I guess that this is enough to make the point that there will be plenty of false positives. With that in mind let's try something more challenging.
The new Bond movies is coming up and, as a fan, I am excited to see the it. While I wait for the movie let's parse all Ian Fleming's books and find out how many places in the world has 007 used his license to kill.
I will explain the code in the cells below using bad Bond puns.
def utf8toascii(text):
return text.decode("utf-8").encode("ascii", "ignore")
def read_in_chunks(file_object, chunk_size=2048):
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
import os
import cPickle as pickle
if not os.path.exists('./data/books.pickle'):
import geograpy
from glob import glob
books = dict()
for book in glob('*.txt'):
countries = []
with open(book) as f:
for chunk in read_in_chunks(f):
try:
chunk = utf8toascii(chunk)
p = geograpy.get_place_context(text=chunk)
except UnicodeDecodeError:
pass # End of data.
countries.extend(p.countries)
book_name = book.split('.txt')[0]
books.update({book_name: countries})
with open('./data/books.pickle', 'wb') as f:
pickle.dump(books, f)
else:
with open('./data/books.pickle', 'rb') as f:
books = pickle.load(f)
%matplotlib inline
import pandas as pd
import numpy as np
from collections import Counter
dfs = []
for book, countries in books.items():
book = book.split('-ian_fleming')[0]
labels, values = zip(*Counter(countries).items())
dfs.append(pd.DataFrame(np.array(values), index=labels, columns=[book]))
df = pd.concat(dfs, axis=1)
all_books = df.T.sum().sort_index()
if not os.path.exists('./data/positions.pickle'):
import time
from geopy import GeoNames
from geopy.geocoders.base import GeocoderTimedOut
positions = dict()
geolocator = GeoNames(username=username)
for country in df.index:
while True:
try:
position = geolocator.geocode(country)
except:
time.sleep(5)
continue
break
if position:
location = [position.latitude, position.longitude]
positions.update({country: location})
del position
else:
print("Could not get position for {}".format(country))
with open('./data/positions.pickle', 'wb') as f:
pickle.dump(positions, f)
else:
with open('./data/positions.pickle', 'rb') as f:
positions = pickle.load(f)
import folium
mapa = folium.Map(tiles="Cartodb dark_matter", location=[0, 0], zoom_start=2)
for country, location in positions.items():
times = int(all_books[country])
popup = "{} was mentioned {} times.".format(country, times)
mapa.simple_marker(location=location, popup=popup,
marker_icon="ok",
marker_color="orange",
clustered_marker=True)
mapa
I am pretty sure Uruguay was not mentioned 52 times in Ian Fleming's novels. On the other hand 139 mentions of Russia sounds about right ;-)
HTML(html)