Notebook

Visualization using Bokeh and MPLD3¶

In this lab, you are going to use Bokeh and MPLD3 to make some interactive visualizations. MPLD3 is a Matplotlib-like API built on top of D3.js. Bokeh is a self-contained tool that uses SVG and Javascript to provide interactivity in a browser. If you want to understand their magic and create some highly customized visualization, we encourage you to learn some D3 using this free ebook.

Bokeh Overview¶

Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. It can render SVG into a HTML Browser by calling the python API. It also supports ipython notebook so we can easily plot nice graphics in our notebook pages.

In order to use it, you first need to install its python package:

sudo pip install bokeh

After that, we can import the bokeh library and enable the notebook support. You should be able to see "BokehJS successfully loaded." after you run the following code.

In [ ]:

from bokeh.plotting import *
output_notebook()

Multidimensional data visualization¶

Then we can begin to play with some data. We first load some sample data about the performance of some cars. The data is actually a Panda dataframe and we can use .head() to see what is inside.

In [ ]:

from bokeh.sampledata.autompg import autompg
autompg.head(10)

We then continue to transform the data into some format that could be used by Bokeh. The ColumnDataSource() will return an object similar to dict. You can use the help() to see what's inside.

In [ ]:

source = ColumnDataSource(autompg.to_dict("list"))
#help(source)
#help(source.data)

Use the data, we can create some visualizations. First we need to configure the plot. We need to set the hight/width of our plot, and set the tools we would like to provide within the visualization.

In [ ]:

plot = figure(plot_width=400, plot_height=400, title="MPG by Year",  tools="pan,wheel_zoom,box_zoom,box_select,reset")

With the data and the configuration, we can make a scatterplot using the circle() command. Here we need to provide two columns of the data to make the 2-D scatterplot. Since the data is already formatted, we only need to use the column name as the indicator.

Also, we can use an extra column to encode the size of each circle. Here we use the "cyl" column. Therefore, for each circle in the graphics, its x,y location will be determined by "yr" and "mpg", while its size is determined by "cyl". The color will always be blue.

In [ ]:

plot.circle("yr", "mpg", size="cyl",color="blue", source=source),
show(plot)

We have already created a nice scatterplot, but a single plot is not fun for such multidimensional data. Here we can create a multi-scatterplot and interactively explore their relationship.

In [ ]:

fig1 = figure(plot_width=300, plot_height=300, title="MPG by Year", tools="pan,wheel_zoom,box_zoom,box_select,reset")
fig1.circle("yr", "mpg", color="blue", source=source)
fig2 = figure(plot_width=300, plot_height=300, title="HP vs. Displacement", tools="pan,wheel_zoom,box_zoom,box_select,reset")
fig2.circle("hp", "displ", color="green", source=source)
fig3 = figure(plot_width=300, plot_height=300, title="MPG vs. Displacement", tools="pan,wheel_zoom,box_zoom,box_select,reset")
fig3.circle("mpg", "displ", size="cyl", line_color="red", fill_color=None, source=source)
p = gridplot([[fig1, fig2], [None, fig3]])
show(p)

TODO: Using the selection tool to explore the data. For example, what's the distribution of MPG when you select different clusters of data in the "HP vs Displacement" figure?

Here you can make your own plots using other dimensions of the dataset.

In [ ]:

fig4 = figure(plot_width=300, plot_height=300, tools="pan,wheel_zoom,box_zoom,box_select,reset")
fig5 = figure(plot_width=300, plot_height=300, tools="pan,wheel_zoom,box_zoom,box_select,reset")
fig6 = figure(plot_width=300, plot_height=300, tools="pan,wheel_zoom,box_zoom,box_select,reset")
fig7 = figure(plot_width=300, plot_height=300, tools="pan,wheel_zoom,box_zoom,box_select,reset")
# include some other plots here

TODO: Try to make more plots to explore the dataset using different data columns, briefly describe your findings. For example, Which columns may have high correlation?

Geodata visualization¶

Next we'll explore a geographic dataset. You'll need to download the sample dataset first.

In [ ]:

from bokeh.sampledata import download
download()

Then we import two datasets. The us_counties dataset will have the geo-location of the boundary for each county in the US. (Each boundary is represented as a polygon, using longitude and latitude as the coordinates) And the unemployment dataset has the unemployment rate for each county. The two datasets have the same key for each county.

In [ ]:

from bokeh.sampledata import us_counties, unemployment
print(us_counties.data.items()[0])
print(unemployment.data.items()[0])

Using the data, we would like to explore what are the unemployment rates in California. We start by extracting the bounday data for countries inside CA. Then we extract the unemployment rate using the county_id, and scale it into an integer which can be mapped into a color.

In [ ]:

from collections import OrderedDict

county_xs=[
    us_counties.data[county_id]['lons'] for county_id in us_counties.data
    if us_counties.data[county_id]['state'] == 'ca'
]
county_ys=[
    us_counties.data[county_id]['lats'] for county_id in us_counties.data
    if us_counties.data[county_id]['state'] == 'ca'
]

# D3 category10 colors
colors =["#1f77b4","#ff7f0e","#2ca02c","#d62728","#9467bd","#8c564b","#e377c2","#7f7f7f","#bcbd22"]

# light-to-dark encoding (try it)
# colors = ["#f7fcfd","#e5f5f9","#ccece6","#99d8c9","#66c2a4","#41ae76","#238b45","#006d2c","#00441b"]


rates=[]
names = []
for county_id in us_counties.data:
    if us_counties.data[county_id]['state'] != 'ca':
        continue
    try:
        rate = unemployment.data[county_id]
        rates.append(rate)
        names.append(us_counties.data[county_id]['name'])
    except KeyError:
        rates.append(0)
min_rate=min(rates)
max_rate=max(rates)        
idx = map(lambda rate:min(int(rate-min_rate), 8),rates)
county_colors = map(lambda id: colors[id],idx)
alldata=ColumnDataSource({"name":names,"rate":rates})

Then we can use the patches() command to plot a map. The patches() function is drawing polygons and it will take two data_arrays x and y to indicate location of the vertices. The fill_color options will give each polygon a color. We also need to pass some other data (including the name, unemployment rate for each country) using the "source" option.

You can use the output_file() and save() command to save the figure into a local HTML file. (It works for other kind of plots as well)

In [ ]:

fig8 = figure(plot_width=500, plot_height=500, title="California Unemployment 2009")

fig8.patches(county_xs, county_ys, fill_color=county_colors, fill_alpha=0.7,
        line_color="white", line_width=0.5, source=alldata)


#output_file("california.html")
#save()
show(fig8)

Here we used the D3 category10 color to encode the unemployment level, which is good for category data but is not as good for numerical data. Try the color gradient suggested in the code above. You should goto websites like Colorbrew2 to pickup some nice color gradients and use them to encode the data.

Overview of MPLD3¶

MPLD3 is a python library to bring mplotlib into the browser. Its python backend will render a regular mplotlib fig into a JSON format, and use a D3 based Javascript frontend to render it into SVG. It also makes the original static mplotlib figures more interactive.

In order to use it, we first need to install the python package using:

sudo pip install mpld3

Then we can start to use it in the python notebook. If we want to plot figures inside the notebook, remember to use the inline option and call the enable_notebook() function.

In [ ]:

%matplotlib inline 
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

import mpld3
from mpld3 import plugins, utils
mpld3.enable_notebook()

Here we continue to use the MPG dataset from previous section. We can construct a scatterplot using 3 columns: "mpg" "hp" "displ" and use "yr" to encode the color.

Also, we can use the default LinkedBrush plugin to give us similar selection function as we did with Bokeh.

In [ ]:

fig, ax = plt.subplots(3, 3, sharex="col", sharey="row", figsize=(8, 8))
fig.subplots_adjust(left=0.05, right=0.95, bottom=0.05, top=0.95,
                    hspace=0.1, wspace=0.1)

name=["mpg","hp","displ"]
data=autompg
#data=autompg.head(100)
for i in range(3):
    for j in range(3):
        points = ax[2 - i, j].scatter(data[name[j]], data[name[i]],
                                      c=data["yr"],s=40, alpha=0.6)

# Here we connect the linked brush plugin
plugins.connect(fig, plugins.LinkedBrush(points))

mpld3.display(fig)

Scattplot with tooltips¶

By using a D3 based browser rendering engine, our visualization supports interaction with the user. With the help of the PointLabelTooltip plugin, we can create a popup when the user mouses over a data point. Here, the labels will indicate what to show in the popup. For now we only display the data id in the popup.

For more information about the Plugins in MPLD3, you can view their documents to see how to create your own plugins.

In [ ]:

fig, ax = plt.subplots(subplot_kw=dict(axisbg='#EEEEEE'))

N=len(data)
scatter = ax.scatter(data["hp"],
                     data["mpg"],
                     c=data["yr"],
                     s=data["cyl"]*10,
                     alpha=0.3,
                     cmap=plt.cm.jet)
ax.grid(color='white', linestyle='solid')

ax.set_title("Scatter Plot (with tooltips!)", size=20)

labels = ['point {0}'.format(i + 1) for i in range(N)]
#labels = [data["name"][i] for i in range(N)]
tooltip = mpld3.plugins.PointLabelTooltip(scatter, labels=labels)
mpld3.plugins.connect(fig, tooltip)

mpld3.display(fig)

TODO: Showing the data id is not very interesting. We actually have a column with the name for each car. Change the above code to show that information when you mouseover a data point.

Exploring the magic behind it¶

When calling the python API, many things happened behind the scenes. The plot routines first write out a JSON format data and use it for further rendering. You can save the JSON format mid-level representation into a local file to see what's inside.

In [ ]:

mpld3.save_json(fig,"viz.json")
mpld3.save_html(fig,"viz.html")

And now let's briefly explore what is the magic in the rendering engine D3. D3 is a library which helps you manipulate the DOM objects (especually SVG) in the HTML pages. In the following code, we will manually create three circles and set their x,y location as well as radius based on the data array [1,2,3]

In [ ]:

%%html
<div id=canvas>
</script src="d3js.org/d3.v3.min.js">
<script>
    var svg=d3.select("#canvas").append("svg").attr("width",400).attr("height",200)
    svg.selectAll("circle").data([1,2,3])
    .enter().append("circle")
    .attr("cx",function(d){return d*100})
    .attr("cy",100)
    .attr("r",function(d){return d*20})
    .attr("fill","red")
    
</script>

This static SVG is not very exciting, but we can animate it using the powerful transition() function. This will show a linear interpolation between the initial state and the final state.

In [ ]:

%%html
<script>
    d3.selectAll("circle")
    .transition().duration(2000)
    .attr("r",function(d){return d*10})
    .attr("cx",function(d){return 50+50*d})
</script>

And finally, we can bind some functions to make it interactive. After running the following code, it will be able change the color when you mouseover/out one circle.

In [ ]:

%%html
<script>
    d3.selectAll("circle")
    .on("mouseover",function(d){d3.select(this).attr("fill","blue")})
    .on("mouseout",function(d){d3.select(this).attr("fill","red")})
</script>

These are only some of D3's animation features. You can learn more using this nice ebook and the wiki on the D3 website. D3 provides a lot of flexibility, but its also a lot of effort and the end result (SVG/javascript hybrid code) is often hard to read. We suggest you use high-level libraries like Bokeh and MPLD3 which have already a very rich set of glyphs and interactions.