Identifying Location Trends in Foursquare

In [7]:
from IPython.display import HTML
HTML('<iframe src="http://en.wikipedia.org/wiki/Foursquare" width=100% height=350></iframe>')
Out[7]:

Task:

  • Find trending locations on Foursquare by comparing some metrics at t1 to t0
  • Find metrics that could be useful
  • Map the data

Initialization

First, you have to find your access tokens to use the Foursquare API with reasonable rate limits.

If you have an access token, you can use that, otherwise register an app an use the client id and secret for the following steps. https://de.foursquare.com/developers/register

In [1]:
import foursquare
import pandas as pd

#ACCESS_TOKEN = ""
#client = foursquare.Foursquare(access_token=ACCESS_TOKEN)

CLIENT_ID = ""
CLIENT_SECRET = ""
client = foursquare.Foursquare(client_id=CLIENT_ID, client_secret=CLIENT_SECRET)
  1. Starting at Munich Marienplatz as seed venue
  2. Fetching the 5 next venues (from API nextvenues, see: https://developer.foursquare.com/docs/venues/nextvenues)
  3. For each of the 5 venues fetch the 5 next venues
  4. Repeat until saturation (no new locations)
In [2]:
bbox = [11.109872,47.815652,12.068588,48.397136] # bounding box for Munich
# bbox = [13.088400,52.338120,13.761340,52.675499] # bounding box for Berlin
#bbox = [5.866240,47.270210,15.042050,55.058140] # bounding box for Germany
In [436]:
new_crawl = [] # list of locations to be crawled
done = [] # list of crawled locations
links = [] # list of tuples that represent links between locations
venues = pd.DataFrame() # dictionary of locations id => meta-data on location

Set seed values for Marienplatz, Airport and Central Station.

Depth is the number of recursive crawling processes.

In [438]:
to_crawl = ["4ade0ccef964a520246921e3", "4cbd1bfaf50e224b160503fc", "4b0674e2f964a520f4eb22e3"]
depth = 25
In [439]:
for i in range(depth):
    new_crawl = []
    print "Step " + str(i) + ": " + str(len(venues)) + " locations and " + str(len(links)) + " links. " + str(len(to_crawl)) + " venues to go."
    for v in to_crawl:
        if v not in venues:
            res = client.venues(v)
            venues = venues.append(pd.DataFrame({"name":res["venue"]["name"],"users":res["venue"]["stats"]["usersCount"],
            "checkins":res["venue"]["stats"]["checkinsCount"], "lat":res["venue"]["location"]["lat"], 
            "lng":res["venue"]["location"]["lng"]}, index=[v]))
        next_venues = client.venues.nextvenues(v)
        for nv in next_venues['nextVenues']['items']:
            if ((nv["location"]["lat"] > bbox[1]) & (nv["location"]["lat"] < bbox[3]) & 
                (nv["location"]["lng"] > bbox[0]) & (nv["location"]["lng"] < bbox[2])):
                if nv["id"] not in venues:
                    venues = venues.append(pd.DataFrame({"name":nv["name"],"users":nv["stats"]["usersCount"],
                    "checkins":nv["stats"]["checkinsCount"], "lat":nv["location"]["lat"], 
                    "lng":nv["location"]["lng"]}, index=[nv["id"]]))
                if (nv["id"] not in done) & (nv["id"] not in to_crawl) & (nv["id"] not in new_crawl):
                    new_crawl.append(nv["id"])
                links.append((v, nv["id"]))
        done.append(v)
    to_crawl = new_crawl
Step 0: 0 locations and 0 links. 3 venues to go.
Step 1: 13 locations and 10 links. 8 venues to go.
Step 2: 59 locations and 48 links. 19 venues to go.
Step 3: 167 locations and 137 links. 14 venues to go.
Step 4: 240 locations and 196 links. 18 venues to go.
Step 5: 327 locations and 265 links. 22 venues to go.
Step 6: 425 locations and 341 links. 25 venues to go.
Step 7: 556 locations and 447 links. 31 venues to go.
Step 8: 720 locations and 580 links. 31 venues to go.
Step 9: 876 locations and 705 links. 39 venues to go.
Step 10: 1046 locations and 836 links. 27 venues to go.
Step 11: 1134 locations and 897 links. 19 venues to go.
Step 12: 1201 locations and 945 links. 12 venues to go.
Step 13: 1232 locations and 964 links. 5 venues to go.
Step 14: 1250 locations and 977 links. 8 venues to go.
Step 15: 1286 locations and 1005 links. 11 venues to go.
Step 16: 1329 locations and 1037 links. 4 venues to go.
Step 17: 1351 locations and 1055 links. 7 venues to go.
Step 18: 1377 locations and 1074 links. 3 venues to go.
Step 19: 1388 locations and 1082 links. 1 venues to go.
Step 20: 1391 locations and 1084 links. 0 venues to go.
Step 21: 1391 locations and 1084 links. 0 venues to go.
Step 22: 1391 locations and 1084 links. 0 venues to go.
Step 23: 1391 locations and 1084 links. 0 venues to go.
Step 24: 1391 locations and 1084 links. 0 venues to go.

Generating the network

We're importing networkx to build the network out of our crawled venues (= nodes) and links between them.

In [440]:
venues = venues.reset_index().drop_duplicates(cols='index',take_last=True).set_index('index')
venues.head()
Out[440]:
checkins lat lng name users
index
4cbd1bfaf50e224b160503fc 244992 48.352599 11.780992 München Flughafen "Franz Josef Strauß" (MUC) 91098
4b0674e2f964a520f4eb22e3 91099 48.140547 11.555772 München Hauptbahnhof 19551
4b56f6eef964a520ec2028e3 1913 48.137558 11.579466 Augustiner am Platzl 1533
4bbc6329afe1b7136d4d304b 2979 48.135282 11.576350 Biergarten am Viktualienmarkt 1691
4b335a36f964a520d51825e3 3406 48.134848 11.575170 Der Pschorr 2409
In [441]:
labels = venues["name"].to_dict()
In [442]:
import networkx as nx
G = nx.DiGraph()
G.add_nodes_from(venues.index)
for f,t in links:
    G.add_edge(f, t)
In [443]:
nx.info(G)
Out[443]:
'Name: \nType: DiGraph\nNumber of nodes: 307\nNumber of edges: 1084\nAverage in degree:   3.5309\nAverage out degree:   3.5309'

Calculate some useful metrics:

  • Betweenness Centrality: Number of shortest paths in the network that pass through this node
  • Page Rank: Random Walk through city
In [444]:
pagerank = nx.pagerank(G,alpha=0.9)
betweenness = nx.betweenness_centrality(G)

venues['pagerank'] = [pagerank[n] for n in venues.index]
venues['betweenness'] = [betweenness[n] for n in venues.index]

Plot some stats

In [445]:
%pylab inline
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(8, 6), dpi=150)
ax = fig.add_subplot(111)
venues.sort('users', inplace=True)
venues.set_index('name')[-20:].users.plot(kind='barh')
ax.set_ylabel('Location')
ax.set_xlabel('Users')
ax.set_title('Top 20 Locations by Users')
plt.show()
Populating the interactive namespace from numpy and matplotlib
WARNING: pylab import has clobbered these variables: ['f']
`%matplotlib` prevents importing * from pylab and numpy
In [446]:
fig = plt.figure(figsize=(8, 6), dpi=150)
ax = fig.add_subplot(111)
venues.sort('checkins', inplace=True)
venues.set_index('name')[-20:].checkins.plot(kind='barh')
ax.set_ylabel('Location')
ax.set_xlabel('Checkins')
ax.set_title('Top 20 Locations by Checkins')
plt.show()
In [447]:
fig = plt.figure(figsize=(8, 6), dpi=150)
ax = fig.add_subplot(111)
venues.sort('pagerank', inplace=True)
venues.set_index('name')[-20:].pagerank.plot(kind='barh')
ax.set_ylabel('Location')
ax.set_xlabel('Pagerank')
ax.set_title('Top 20 Locations by Pagerank')
plt.show()
In [448]:
fig = plt.figure(figsize=(8, 6), dpi=150)
ax = fig.add_subplot(111)
venues.sort('betweenness', inplace=True)
venues.set_index('name')[-20:].betweenness.plot(kind='barh')
ax.set_ylabel('Location')
ax.set_xlabel('Pagerank')
ax.set_title('Top 20 Locations by Betweenness Centrality')
plt.show()

Visualize the network

In [449]:
fig = plt.figure(figsize=(16, 9), dpi=150)
graph_pos=nx.spring_layout(G)
nodesize = [10000*n for n in pagerank.values()]
nx.draw_networkx_nodes(G,graph_pos,node_size=nodesize, alpha=0.5, node_color='blue')
nx.draw_networkx_edges(G,graph_pos,width=0.8, alpha=0.4,edge_color='blue')
nx.draw_networkx_labels(G, graph_pos, labels=labels, font_size=10, font_family='Arial')
plt.axis('off')
plt.show()

Finally, save the network for further analysis e.g. in Gephi

In [450]:
nx.write_graphml(G, "./fs_loc_muc_l_jul14.graphml")

Trend Research with location data

Loading historical location data gathered with the same method in May 2014

In [261]:
F = nx.read_graphml("fs_loc_muc_l_may14.graphml")
In [262]:
pagerank_old = nx.pagerank(F,alpha=0.9,max_iter=200)
betweenness_old = nx.betweenness_centrality(F)

venues_old['pagerank'] = [pagerank_old[n] for n in venues_old.index]
venues_old['betweenness'] = [betweenness_old[n] for n in venues_old.index]
In [158]:
fig = plt.figure(figsize=(16, 9), dpi=150)
graph_pos=nx.spring_layout(F)
nodesize = [10000*n for n in pagerank_old.values()]
nx.draw_networkx_nodes(F,graph_pos,node_size=nodesize, alpha=0.5, node_color='red')
nx.draw_networkx_edges(F,graph_pos,width=0.8, alpha=0.4,edge_color='red')
nx.draw_networkx_labels(F, graph_pos, labels=venues_old["name"].to_dict(), font_size=10, font_family='Arial')
plt.axis('off')
plt.show()