#!/usr/bin/env python
# coding: utf-8

# # Visualization with Matplotlib and Seaborn
# 
# We previously looked at an introduction to NumPy and Pandas, which are the two core Python libraries for handling data. Here we'll look at Matplotlib and Seaborn as visualization tools.
# 
# Numpy is a core package for handling array-based data, and Pandas builds on NumPy to provide operations which work quickly on labeled data. Analogously, matplotlib is a core package which makes use of NumPy arrays for visualization, and Seaborn builds on Matplotlib to provide operations which work on Pandas data.
# 
# Thus, we'll start with a quick intro to the basics of Matplotlib before moving on to show how Seaborn makes your life easier.

# ## Matplotlib
# 
# The [matplotlib](http://matplotlib.org) library is a powerful tool capable of producing complex publication-quality figures with fine layout control in two and three dimensions; here we will only provide a minimal self-contained introduction to its usage that covers the functionality needed for the rest of the section.
# 
# Just as we typically use the shorthand `np` for Numpy, we will use `plt` for the `matplotlib.pyplot` module where the easy-to-use plotting functions reside (the library also contains a rich object-oriented architecture that we don't have the space to discuss here):

# In[1]:


from __future__ import print_function, division

get_ipython().run_line_magic('matplotlib', 'inline')
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd


# Here we'll go through just the basics of using matplotlib to create visualizations in Python.

# ### The ``plot`` command:
# 
# Basic plots can be created with the ``plt.plot`` command:

# In[2]:


plt.plot(np.random.rand(100));


# #### Plotting a function: $f(x) = \sin(x)$
# 
# Above we called ``plt.plot(y)``; we can also call ``plt.plot(x, y)``:

# In[3]:


x = np.linspace(0, 2*np.pi, 300)
y = np.sin(x)
plt.plot(x, y);


# #### Titles, labels, etc.
# 
# The ``plt.plot()`` function is a real workhorse: you can use it for line plots, scatter plots, plotting of multiple lines at one time, etc. By adding other ``plt`` functions, you can add other plot elements as well.
# 
# Here is how you can make a simple plot of $\sin(x)$ and $\sin(x^2)$ for $x \in [0, 2\pi]$ with labels and a grid (we use the semicolon in the last line to suppress the display of some information that is unnecessary right now):

# In[4]:


y2 = np.sin(x**2)
plt.plot(x, y, label=r'$\sin(x)$')
plt.plot(x, y2, label=r'$\sin(x^2)$')
plt.title('Some functions')
plt.xlabel('x')
plt.ylabel('y')
plt.grid()
plt.legend();


# #### Controlling lines and markers
# 
# You can control the style, color and other properties of the markers, for example:

# In[5]:


x = np.linspace(0, 2*np.pi, 50)
y = np.sin(x)
plt.plot(x, y, linewidth=3, color='green');


# In[6]:


plt.plot(x, y, 'o', markersize=6, color='r');


# There is much more that can be done with the simple ``plt.plot`` function; for help, take a look at the documentation using IPython's ``?`` functionality:

# In[7]:


get_ipython().run_line_magic('pinfo', 'plt.plot')


# ### Other plot types
# 
# Other plot types can be made using other top-level matplotlib commands:
# 
# #### Errorbars
# 
# We will now see how to create a few other common plot types, such as a simple error plot:

# In[8]:


# example data
x = np.arange(0.1, 4, 0.5)
y = np.exp(-x)

# example variable error bar values
yerr = 0.1 + 0.2*np.sqrt(x)
xerr = 0.1 + yerr

# First illustrate basic pyplot interface, using defaults where possible.
plt.figure()
plt.errorbar(x, y, xerr=0.2, yerr=0.4, fmt='o')
plt.title("Simplest errorbars, 0.2 in x, 0.4 in y");


# #### Log plots
# 
# Logarithmic scales can be used by calling ``plt.semilogx``, ``plt.semilogy``, and ``plt.loglog``:

# In[9]:


x = np.linspace(-5, 5)
y = np.exp(-x**2)
plt.semilogy(x, y);


# #### Histograms
# 
# A histogram can be created using the ``plt.hist(x, bins)`` function.
# Here we'll also add some labels and grid lines to the histogram:

# In[10]:


mu, sigma = 100, 15
x = mu + sigma * np.random.randn(10000)

# the histogram of the data
n, bins, patches = plt.hist(x, 50, normed=1, facecolor='g', alpha=0.75)

plt.xlabel('Smarts')
plt.ylabel('Probability')
plt.title('Histogram of IQ')
# This will put a text fragment at the position given:
plt.text(55, .027, r'$\mu=100,\ \sigma=15$', fontsize=14)
plt.axis([40, 160, 0, 0.03])
plt.grid(True)


# #### Two-dimensional arrays & images
# 
# Two-dimensional data and images can be plotted using ``plt.imshow``:

# In[11]:


plt.imshow(np.random.rand(5,10), interpolation='nearest', cmap='Blues');


# Images can be shown in very flexible ways. For example, it can even handle RGB tuples at each point:

# In[12]:


img = plt.imread('images/stoplight.png')
img.shape


# In[13]:


plt.imshow(img)
plt.xticks([])
plt.yticks([]);


# #### Subplots
# 
# Let's create some subplots and plot the R, G, B channels in the image separately

# In[14]:


fig, ax = plt.subplots(1, 4, figsize=(10,6),
                       subplot_kw=dict(xticks=[], yticks=[]))

for i, cmap in enumerate(['Red', 'Green', 'Blue']):
    ax[i].imshow(img[:,:,i], cmap=cmap + 's_r')
    ax[i].set_title(cmap)
ax[3].imshow(img)
ax[3].set_title('RGB');


# #### Simple 3d plotting
# 
# Matplotlib is mostly designed for two-dimensional plots, but it also has some three-dimensional plotting capabilities.
# 
# For 3D plots, you must import the 3D toolkit as follows:

# In[15]:


from mpl_toolkits.mplot3d import Axes3D


# One this has been done, you can create 3d axes with the `projection='3d'` keyword to `add_subplot`:
# 
#     fig = plt.figure()
#     fig.add_subplot(...,
#                     projection='3d')
# 
# Here is a simple 3D surface plot:

# In[16]:


from mpl_toolkits.mplot3d.axes3d import Axes3D
from matplotlib import cm

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1, projection='3d')
X = np.arange(-5, 5, 0.25)
Y = np.arange(-5, 5, 0.25)
X, Y = np.meshgrid(X, Y)
R = np.sqrt(X**2 + Y**2)
Z = np.sin(R)
surf = ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap='rainbow',
        linewidth=0, antialiased=False)
ax.set_zlim3d(-1.01, 1.01);


# ### Finding More: The Matplotlib Gallery
# 
# There is much, much more that matplotlib can do: we've just scratched the surface here. For more info, check out the [matplotlib documentation](), and especially the [matplotlib gallery](http://matplotlib.sourceforge.net/gallery.html)
# 
# One of the most useful ways to learn how to use matplotlib is to search the gallery for a plot that you're interested in, and then load the code using IPython's ``%load`` magic. Then you can run, modify, and experiment with the code:

# In[17]:


# %load http://matplotlib.org/mpl_examples/pie_and_polar_charts/polar_scatter_demo.py


# ## Seaborn
# 
# Matplotlib is a useful tool, but it leaves much to be desired. There are several valid complaints about matplotlib that often come up:
# 
# - Matplotlib's defaults are not exactly the best choices. It was based off of MatLab circa 1999, and this shows.
# - Matplotlib is relatively low-level. Doing sophisticated statistical visualization is possible, but often requires a *lot* of boilerplate code.
# - Matplotlib is not designed for use with Pandas dataframes. In order to visualize data from a Pandas dataframe, you must extract each series and often concatenate these series' together into the right format.
# 
# The answer to these problems is [seaborn](http://stanford.edu/~mwaskom/software/seaborn/). Seaborn provides an API on top of matplotlib which uses sane plot & color defaults, uses simple functions for common statistical plot types, and which integrates with the functionality provided by Pandas dataframes.
# 
# Let's take a look at seaborn in action. We'll start by importing seaborn, which by convention is imported as ``sns``.
# 
# We can set the seaborn style as the default matplotlib style by calling ``sns.set()``: after doing this, even simple matplotlib plots will look much better.
# Let's look at a before and after:

# In[18]:


x = np.linspace(0, 10, 1000)
plt.plot(x, np.sin(x), x, np.cos(x));


# In[19]:


import seaborn as sns
sns.set(color_codes=True)
plt.plot(x, np.sin(x), x, np.cos(x));


# Ah, much better!

# ### Exploring Seaborn Plots
# 
# The main idea of Seaborn is that it can create complicated plot types from Pandas data with relatively simple commands.
# 
# Let's take a look at a few of the datasets and plot types available in seaborn. Note that all o the following *could* be done using raw matplotlib commands (this is, in fact, what seaborn does under the hood) but the seaborn API is much more convenient.

# #### Histograms, KDE, and Densities
# 
# Often in statistical data visualization, all you want is to plot histograms and joint distributions of variables.
# Matplotlib provides simple tools to make this happen:

# In[20]:


data = np.random.multivariate_normal([0, 0], [[5, 2], [2, 2]], size=2000)
data = pd.DataFrame(data, columns=['x', 'y'])

for col in 'xy':
    plt.hist(data[col], normed=True, alpha=0.5)


# Rather than a histogram, we can get a smooth estimate of the distribution using a kernel density estimation:

# In[21]:


for col in 'xy':
    sns.kdeplot(data[col], shade=True)


# Histograms and KDE can be combined using ``distplot``:

# In[22]:


sns.distplot(data['x']);


# If we pass the two variables to ``kdeplot``, we will get a bivariate visualization of the data:

# In[23]:


sns.kdeplot(data["x"], data["y"]);


# We can see the joint distribution and the marginal distributions together using ``sns.jointplot``.
# For this plot, we'll set the style to a white background:

# In[24]:


with sns.axes_style('white'):
    sns.jointplot("x", "y", data, kind='kde');


# There are other parameters which can be passed to ``jointplot``: for example, we can use a hexagonally-based histogram instead:

# In[25]:


with sns.axes_style('white'):
    sns.jointplot("x", "y", data, kind='hex')


# #### Pairplots
# 
# When you generalize joint plots to data sets of larger dimensions, you end up with *pair plots*. This is very useful for exploring correlations between multi-dimensional data, when you'd like to plot all pairs of values against each other.
# 
# We'll demo this with the well-known *iris* dataset, which lists measurements of petals and sepals of three iris species:

# In[26]:


iris = sns.load_dataset("iris")
iris.head()


# Visualizing the multi-dimensional relationships among the samples is as easy as calling ``sns.pairplot``:

# In[27]:


sns.pairplot(iris, hue='species');


# #### Faceted Histograms
# 
# Sometimes the best way to view data is via histograms of subsets. Seaborn's ``FacetGrid`` makes this extremely simple.
# We'll take a look at some data which shows the amount that restaurant staff receive in tips based on various indicator data:

# In[28]:


tips = sns.load_dataset('tips')
tips.head()


# In[29]:


tips['tip_pct'] = 100 * tips['tip'] / tips['total_bill']

grid = sns.FacetGrid(tips, row="sex", col="time", margin_titles=True)
grid.map(plt.hist, "tip_pct", bins=np.linspace(0, 40, 15), color="g");


# #### Factor Plots
# 
# Factor plots can be used to visualize this data as well. This allows you to view the distribution of a parameter within bins defined by any other parameter:

# In[30]:


with sns.axes_style(style='ticks'):
    g = sns.factorplot("day", "total_bill", "sex", data=tips, kind="box")
    g.set_axis_labels("Day", "Total Bill");


# ### Joint Distributions
# 
# Similar to the pairplot we saw above, we can use ``sns.jointplot`` to show the joint distribution between different datasets, along with the associated marginal distributions:

# In[31]:


with sns.axes_style('white'):
    sns.jointplot("total_bill", "tip", data=tips, kind='hex')


# The joint plot can even do some automatic kernel density estimation and regression:

# In[32]:


sns.jointplot("total_bill", "tip", data=tips, kind='reg');


# #### Bar Plots
# 
# Time series can be plotted using ``sns.factorplot``:

# In[33]:


planets = sns.load_dataset('planets')
planets.head()


# In[34]:


with sns.axes_style('white'):
    g = sns.factorplot("year", data=planets, aspect=2, kind="count", color="b")
    g.set_xticklabels(step=5)


# We can learn more by looking at the **method** of discovery of each of these planets:

# In[35]:


with sns.axes_style('white'):
    methods = planets["method"].value_counts().index
    g = sns.factorplot("year", col='method', col_wrap=2, data=planets, kind="count",
                       size=2, aspect=3, palette="Purples_d",
                       order=range(2001, 2015), col_order=methods)
    g.set_ylabels('Number of discoveries')


# For more information on plotting with Seaborn, see the [seaborn documentation](http://stanford.edu/~mwaskom/software/seaborn), the [seaborn gallery](http://stanford.edu/~mwaskom/software/seaborn/examples/index.html), and the official [seaborn tutorial](http://stanford.edu/~mwaskom/software/seaborn/tutorial.html).

# ## Example: Exploring New York City Marathon Data
# 
# *Download this data at https://www.dropbox.com/s/tfy7ygsih7go37j/NYCMresults_2008.csv*
# 
# *Move the file into the ``data`` directory to use it below*

# In[36]:


nyc_data = pd.read_csv('data/NYCMresults_2008.csv')
nyc_data.head()


# In[37]:


nyc_data.dtypes


# We see that Pandas assumed the first row was column labels. Also, we see that the times are of dtype "object".
# Let's fix both of these by providing a list of column names, and by providing a converter for the times:

# In[38]:


def convert_time(s):
    h, m, s = map(int, s.split(':'))
    return pd.datetools.timedelta(hours=h, minutes=m, seconds=s)

nyc_data = pd.read_csv('data/NYCMresults_2008.csv',
                       names=['first', 'last', 'age', 'gender', 'split', 'final'],
                       converters={'split':convert_time, 'final':convert_time})
nyc_data.head()


# That looks much better. For the purpose of our seaborn utilities, let's add columns which give the times in seconds:

# In[39]:


nyc_data['split_sec'] = nyc_data['split'].astype(int) / 1E9
nyc_data['final_sec'] = nyc_data['final'].astype(int) / 1E9


# In[40]:


with sns.axes_style('white'):
    g = sns.jointplot("split_sec", "final_sec", nyc_data, kind='hex', stat_func=None)
    g.ax_joint.plot(np.linspace(4000, 16000),
                    np.linspace(8000, 32000), ':k')


# The dotted line shows where someone's time would lie if they ran the marathon at a perfectly steady pace. The fact that the distribution lies above this indicates (as you might expect) that most people slow down over the course of the marathon.
# 
# Let's create another column in the data, the split fraction, which tells whether someone did a negative split or positive split:

# In[41]:


nyc_data['split_frac'] = 1 - 2 * nyc_data['split_sec'] / nyc_data['final_sec']
nyc_data.head()


# Where this split difference is less than zero, the person negative-split the race by that fraction.
# Let's do a distribution plot of this split fraction:

# In[42]:


sns.distplot(nyc_data['split_frac'], kde=False);
plt.axvline(0, color="k", linestyle="--");


# In[43]:


sum(nyc_data.split_frac < 0)


# There were 240 people who negative-split their race.
# 
# Let's see whether there is any correlation between this split fraction and other variables. We'll do this using a `PairGrid`:

# In[44]:


g = sns.PairGrid(nyc_data,
                 x_vars=['age', 'split_sec', 'final_sec'],
                 y_vars=['split_frac'],
                 hue='gender', palette=['r', 'b'], size=4)
g.map(plt.scatter, marker='.')
g.add_legend();


# In[45]:


sns.kdeplot(nyc_data.split_frac[nyc_data.gender=='M'], label='men', color='b', shade=True)
sns.kdeplot(nyc_data.split_frac[nyc_data.gender=='W'], label='women', color='r', shade=True)
plt.xlabel('split_frac');


# The interesting thing here is that there are many more men than women who are running close to an even split!
# This almost looks like some kind of bimodal distribution among the men and women. Let's see if we can suss-out what's going on by looking at the distributions as a function of age.
# 
# A nice way to compare distributions is to use a *Violin Plot*

# In[46]:


def age_range(age_min, age_max):
    return (nyc_data.age >= age_min) & (nyc_data.age < age_max)

sns.violinplot("gender", "split_frac", data=nyc_data, palette=["b", "r"]);


# This is yet another way to view the distributions among men and women.
# 
# Let's look a little deeper, and compare these violin plots as a function of age. We'll start by creating a new column in the array which specifies the decade of age that each person is in:

# In[47]:


nyc_data['age_dec'] = nyc_data.age.map(lambda age: 10 * (age // 10))
nyc_data.head()


# In[48]:


sns.violinplot("age_dec", "split_frac", hue="gender", data=nyc_data,
               split=True, inner="quartile", palette=["b", "r"]);


# Looking at this, we can see where the distributions of men and women differ: the split distributions of men in their 20s-50s show a pronounced over-density toward lower splits when compared to women of the same age (or of any age, for that matter).
# 
# Also surprisingly, the 80-year-old women seem to out-perform *everyone* in terms of their split time. I'm not sure how to explain that.
# 
# Back to the men with fast second-halfs: who are these runners? Does this split fraction correlate with finishing quickly? We can plot this very easily. We'll use ``lmplot``, which will automatically fit a linear regression to the data:

# In[49]:


g = sns.lmplot('final_sec', 'split_frac', col='gender', data=nyc_data,
               markers=".", scatter_kws=dict(color='c'))
g.map(plt.axhline, y=0.1, color="k", ls=":");


# Apparently the people with fast splits are the elite runners who are finishing within ~15000 sec, or about 4 hours. People slower than that are much less likely to have a fast second split.
# 
# I would hypothesize that you could describe the distribution of runners with a two-component Gaussian distribution: there are the *elite* runners who are in shape and have fast splits, and there are the *casual* runners who are less in shape and tend to tire out more. When we get to *Unsupervised Machine Learning*, we'll have a chance to test this theory out.

# In[ ]: