Notebook

In [11]:

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Hierarchical clustering ^^^^^^^^^^^^^^^^^^^^^^^ When you have many categorical measurements for a bunch of samples, it can be helpful to visualize "clusters" in the data. As an example, we'll use a few players' stats from the 2007-2008 NBA season. The argument list of `clusteredheatmap` is very long and we'll try to break it down step-by-step.

In [12]:

import pandas as pd
import seaborn as sns

# data = pd.read_csv('http://datasets.flowingdata.com/ppg2008.csv', index_col=0)
data = pd.read_csv('/Users/olga/Dropbox/ipython/seaborn/ppg2008.csv.bak', index_col=0)
data.index = data.index.map(lambda x: x.strip())

# label source:https://en.wikipedia.org/wiki/Basketball_statistics
labels = ['Games', 'Minutes', 'Points', 'Field goals made',
          'Field goal attempts', 'Field goal percentage', 'Free throws made',
          'Free throws attempts', 'Free throws percentage',
          'Three-pointers made', 'Three-point attempt',
          'Three-point percentage', 'Offensive rebounds', 'Defensive rebounds',
          'Total rebounds', 'Assists', 'Steals', 'Blocks', 'Turnover',
          'Personal foul']
data.columns = labels

Let's see what this looks like by default. `clusteredheatmap` outputs the row and column dendrogram as well as plots the heatmap, so you can use the dendrogram order later. These are dendrograms created by `scipy.cluster.hierarchy.dendrogram`, so they're just `dicts` with the keys `'leaves'` (list of integer indicies in the clustered order), `'ilvl'` (same as `'leaves'`, but as strings instead of integers), `'color_list'` (color of each branch), `'icoord'` (coordinates of the branches, based off of indices), `'dcoord'` (coordinates of the branches, based off of the depth).

In [13]:

row_dendrogram, col_dendrogram = sns.clusteredheatmap(data)

This looks fine, but the values of "Games" and "Minutes" are so much larger than the others, so let's standardize the data across all of the measurements, so they're comparable. While we're at it, let's transpose this so it fits a little better on the screen and is wider than it is tall.

In [14]:

data_normalized = data

# Standardize the mean and variance within a stat, so different stats can be comparable
# (This is the same as changing all the columns to Z-scores)
data_normalized = (data_normalized - data_normalized.mean())/data_normalized.var()

# Normalize these values to range from -1 to 1
data_normalized = (data_normalized)/(data_normalized.max() - data_normalized.min())

data_normalized = data_normalized.T

# Can use a semicolon after the command to suppress output of the row_dendrogram and col_dendrogram.
sns.clusteredheatmap(data_normalized);

Great! Now we can compare players performances across multiple statistics, and see which statistics seem anticorrelated as a group. Saving Figures ^^^^^^^^^^^^^^ This looks good! But how do we save figures? In matplotlib, you can get the current figure instance with `plt.gcf`. ****VERY IMPORTANT**** : When saving the figure, make sure to specify `bbox_inches='tight'`, otherwise your heatmap will be cut off :( Also `plt.tight_layout()` is not your friend here, it will fail because of the complicated figure layout.

In [15]:

import matplotlib.pyplot as plt
sns.clusteredheatmap(data_normalized);
fig = plt.gcf()
fig.savefig('clusteredheatmap_bbox_tight.png', bbox_inches='tight')

Tidy data ^^^^^^^^^ Next, you may want to use a `tidy` dataframe (as in the rest of seaborn) instead of a 2-dimensional `samples x features` type dataframe. Then, just supply `pivot_kws` for how to pivot the dataframe into a 2D dataframe.

In [16]:

tidy_df = pd.melt(data_normalized.reset_index(), id_vars='index')
tidy_df.head()

Out[16]:

	index	variable	value
0	Games	Dwyane Wade	0.143158
1	Minutes	Dwyane Wade	0.233535
2	Points	Dwyane Wade	0.718308
3	Field goals made	Dwyane Wade	0.595714
4	Field goal attempts	Dwyane Wade	0.561296

5 rows × 3 columns

In [17]:

pivot_kws = dict(index='index', columns='variable', values='value')
tidy_df.pivot(**pivot_kws).head()

Out[17]:

variable	Al Harrington	Al Jefferson	Allen Iverson	Amare Stoudemire	Andre Iguodala	Antawn Jamison	Ben Gordon	Brandon Roy	Carmelo Anthony	Caron Butler	Chauncey Billups	Chris Bosh	Chris Paul	Corey Maggette	Danny Granger	David West	Deron Williams	Devin Harris	Dirk Nowitzki	Dwight Howard
index
Assists	-0.256042	-0.235208	0.118958	-0.193542	0.150208	-0.203958	-0.047708	0.129375	-0.047708	0.046042	0.264792	-0.141458	0.743958	-0.214375	-0.120625	-0.162292	0.712708	0.316875	-0.151875	-0.256042	...
Blocks	-0.106429	0.393571	-0.177857	0.179286	-0.070714	-0.106429	-0.106429	-0.106429	-0.070714	-0.106429	-0.142143	0.143571	-0.177857	-0.142143	0.286429	0.107857	-0.106429	-0.142143	0.072143	0.822143	...
Defensive rebounds	0.050130	0.387792	-0.261558	0.180000	0.011169	0.257922	-0.222597	-0.144675	0.089091	-0.014805	-0.248571	0.348831	0.024156	0.011169	-0.014805	0.244935	-0.261558	-0.209610	0.361818	0.660519	...
Field goal attempts	0.061296	0.329815	-0.123889	-0.170185	-0.179444	0.172407	0.005741	0.089074	0.218704	0.024259	-0.327593	0.042778	0.015000	-0.327593	0.292778	0.098333	-0.133148	-0.077593	0.376111	-0.327593	...
Field goal percentage	-0.155276	0.136181	-0.265829	0.347236	0.015578	-0.009548	-0.074874	0.050754	-0.135176	-0.084925	-0.260804	0.085930	0.166332	-0.044724	-0.115075	0.010553	0.005528	-0.160302	0.045729	0.513065	...

5 rows × 50 columns

In [18]:

sns.clusteredheatmap(tidy_df, pivot_kws=pivot_kws);

Titles ^^^^^^ To add a title, just add the argument `title`. You can also adjust the fontsize with `title_fontsize`

In [19]:

sns.clusteredheatmap(data_normalized, title='2008 NBA Stats', title_fontsize=24);

Figure and label size ^^^^^^^^^^^ By default, the figure will be sized based on the input dataframe (`figsize = (data.shape[0]*0.5, data.shape[0]*0.5)`, but only to maximum height and width of 40). You can change the figure size with the keyword `figsize`. Also, change the size of the xticklabels and yticklabels via `labelsize`. Unfortunately, at this time you must change both of them together.

In [20]:

sns.clusteredheatmap(data_normalized, figsize=(10, 5), labelsize=10);

Log-scale ^^^^^^^^^ If your data values tend to differ by orders of magnitude, you can both calculate linkage on the log-transformed data, and color the data by a log-scale instead of a linear scale with `color_scale="log"`.

In [21]:

sns.clusteredheatmap(data, color_scale='log');

Clustering data with NAs ^^^^^^^^^^^^^^^^^^^^^^^^ You may have noticed that the above command didn't plot the dendrogram. That's because `scipy.cluster.hierarchy.linkage` can only calculate linkage of complete matrices, without NAs (in this case, where the value was 0). But, you can still plot a clustered heatmap by supplying a `data_na_ok` dataframe, and making sure that `data` indeed has no potential NAs.

In [22]:

sns.clusteredheatmap(data.replace(0, 0.0001), color_scale='log', data_na_ok=data);

Linkage method and metric customiziations ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Default `linkage_method` is `"average"` and the default `metric` is `"euclidean"`. Any valid `linkage_method` for `scipy.cluster.hierarchy.linkage` or `metric` for `scipy.spatial.distance.pdist` from is accepted. If you have pre-calculated your own linkage matrix, check out the section "Row and column customizations" for how to specify it.

In [23]:

# Wacky example of custom linkage method and metric
sns.clusteredheatmap(data_normalized, linkage_method='complete', metric='hamming');

Heatmap customizations ^^^^^^^^^^^^^^^^^^^^^^ We use the matplotlib fuction `pcolormesh` to plot the heatmap itself (it's faster for big matrices than `pcolor`, and `imshow` is better for images because it somewhat rasterizes them). Here you can specify the minimum value plotted via `"vmin"`, and a different colormap via `"cmap"`. This is also where you change the fontsize of the text labels.

In [24]:

import matplotlib as mpl
sns.clusteredheatmap(data_normalized, pcolormesh_kws={'linewidth': 0.1, 'vmin': 0, 'cmap': mpl.cm.Greens});

Row and column customizations ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The keyword arguments within `col_kws` and `row_kws` are powerful but can be confusing. They are * `"linkage_matrix"`. Default `None`. If you have already calculated a linkage matrix (e.g. through simulations), then supply it here. For example, if you have the column linkage matrix, do `col_kws={'linkage_matrix': col_linkage_matrix}` * `cluster`. Default True. Whether or not to cluster this dimension. Maybe you have the rows in a particular order and only want to cluster the columns, then do `row_kws={'cluster': False}` * `label`. Default True. Whether or not to label this dimension. Otherwise, can be an iterable (list or pandas Index or Series) to relabel a dimension. * `side_colors`. Default None, otherwise a list of colors. Whether or not you want to add a color label to that dimension. Here's an example of a few of thse working together. Some of them are the defaults but are there just for illustrative purposes on how to specify multiple keyword arguments.

In [25]:

colors = sns.color_palette('Set2', n_colors=6)

def stat_to_label(stat):
    if set([stat]) & set(('Games', 'Minutes', 'Points')):
        return colors[0]
    if stat.startswith('Field'):
        return colors[1]
    if stat.startswith('Free'):
        return colors[2]
    if stat.startswith('Three'):
        return colors[3]
    if 'rebounds' in stat:
        return colors[4]
    else:
        return colors[5]
    
stat_colors = data_normalized.index.map(stat_to_label)

row_labels = data_normalized.index.map(lambda x: 'relabel {}'.format(x))
sns.clusteredheatmap(data_normalized, row_kws={'side_colors': stat_colors, 'label': row_labels},
                     col_kws={'cluster': False});

Colorbar customization ^^^^^^^^^^^^^^^^^^^^^^ You can customize the appearance of the colorbar with `colorbar_kws`, specifically, * `'fontsize'`. Default 14. Size of the ticklabels. * `'label'`. Default `"values"`. Label on the colorbar. * `'orientation'`. Default `'horizontal'`. Whether the colorbar is oriented horizontally or vertically. The heatmap was designed for the colorbar to be horizontal, so vertical orientation is not recommended. * Anything else accepted by `plt.colorbar`

In [26]:

sns.clusteredheatmap(data_normalized, colorbar_kws={'orientation': 'vertical', 'label': 'Normalized stat', 'fontsize': 10});

Fastcluster ^^^^^^^^^^^ Scipy's clustering module is rather slow, especially if you have more than 1000 rows or columns. If you have more than 1000, this will attempt to import `fastcluster` (available via `pip install fastcluster`) and use the linkage function there. If you want to always use fastcluster, do `use_fastcluster=True`.