Notebook
Hierarchical clustering ^^^^^^^^^^^^^^^^^^^^^^^ When you have many categorical measurements for a bunch of samples, it can be helpful to visualize "clusters" in the data. As an example, we'll use a few players' stats from the 2007-2008 NBA season. The argument list of `clusteredheatmap` is very long and we'll try to break it down step-by-step.
Let's see what this looks like by default. `clusteredheatmap` outputs the row and column dendrogram as well as plots the heatmap, so you can use the dendrogram order later. These are dendrograms created by `scipy.cluster.hierarchy.dendrogram`, so they're just `dicts` with the keys `'leaves'` (list of integer indicies in the clustered order), `'ilvl'` (same as `'leaves'`, but as strings instead of integers), `'color_list'` (color of each branch), `'icoord'` (coordinates of the branches, based off of indices), `'dcoord'` (coordinates of the branches, based off of the depth).
This looks fine, but the values of "Games" and "Minutes" are so much larger than the others, so let's standardize the data across all of the measurements, so they're comparable. While we're at it, let's transpose this so it fits a little better on the screen and is wider than it is tall.
Great! Now we can compare players performances across multiple statistics, and see which statistics seem anticorrelated as a group. Saving Figures ^^^^^^^^^^^^^^ This looks good! But how do we save figures? In matplotlib, you can get the current figure instance with `plt.gcf`. ****VERY IMPORTANT**** : When saving the figure, make sure to specify `bbox_inches='tight'`, otherwise your heatmap will be cut off :( Also `plt.tight_layout()` is not your friend here, it will fail because of the complicated figure layout.
Tidy data ^^^^^^^^^ Next, you may want to use a `tidy` dataframe (as in the rest of seaborn) instead of a 2-dimensional `samples x features` type dataframe. Then, just supply `pivot_kws` for how to pivot the dataframe into a 2D dataframe.
Titles ^^^^^^ To add a title, just add the argument `title`. You can also adjust the fontsize with `title_fontsize`
Figure and label size ^^^^^^^^^^^ By default, the figure will be sized based on the input dataframe (`figsize = (data.shape[0]*0.5, data.shape[0]*0.5)`, but only to maximum height and width of 40). You can change the figure size with the keyword `figsize`. Also, change the size of the xticklabels and yticklabels via `labelsize`. Unfortunately, at this time you must change both of them together.
Log-scale ^^^^^^^^^ If your data values tend to differ by orders of magnitude, you can both calculate linkage on the log-transformed data, and color the data by a log-scale instead of a linear scale with `color_scale="log"`.
Clustering data with NAs ^^^^^^^^^^^^^^^^^^^^^^^^ You may have noticed that the above command didn't plot the dendrogram. That's because `scipy.cluster.hierarchy.linkage` can only calculate linkage of complete matrices, without NAs (in this case, where the value was 0). But, you can still plot a clustered heatmap by supplying a `data_na_ok` dataframe, and making sure that `data` indeed has no potential NAs.
Linkage method and metric customiziations ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Default `linkage_method` is `"average"` and the default `metric` is `"euclidean"`. Any valid `linkage_method` for `scipy.cluster.hierarchy.linkage` or `metric` for `scipy.spatial.distance.pdist` from is accepted. If you have pre-calculated your own linkage matrix, check out the section "Row and column customizations" for how to specify it.
Heatmap customizations ^^^^^^^^^^^^^^^^^^^^^^ We use the matplotlib fuction `pcolormesh` to plot the heatmap itself (it's faster for big matrices than `pcolor`, and `imshow` is better for images because it somewhat rasterizes them). Here you can specify the minimum value plotted via `"vmin"`, and a different colormap via `"cmap"`. This is also where you change the fontsize of the text labels.
Row and column customizations ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The keyword arguments within `col_kws` and `row_kws` are powerful but can be confusing. They are * `"linkage_matrix"`. Default `None`. If you have already calculated a linkage matrix (e.g. through simulations), then supply it here. For example, if you have the column linkage matrix, do `col_kws={'linkage_matrix': col_linkage_matrix}` * `cluster`. Default True. Whether or not to cluster this dimension. Maybe you have the rows in a particular order and only want to cluster the columns, then do `row_kws={'cluster': False}` * `label`. Default True. Whether or not to label this dimension. Otherwise, can be an iterable (list or pandas Index or Series) to relabel a dimension. * `side_colors`. Default None, otherwise a list of colors. Whether or not you want to add a color label to that dimension. Here's an example of a few of thse working together. Some of them are the defaults but are there just for illustrative purposes on how to specify multiple keyword arguments.
Colorbar customization ^^^^^^^^^^^^^^^^^^^^^^ You can customize the appearance of the colorbar with `colorbar_kws`, specifically, * `'fontsize'`. Default 14. Size of the ticklabels. * `'label'`. Default `"values"`. Label on the colorbar. * `'orientation'`. Default `'horizontal'`. Whether the colorbar is oriented horizontally or vertically. The heatmap was designed for the colorbar to be horizontal, so vertical orientation is not recommended. * Anything else accepted by `plt.colorbar`
Fastcluster ^^^^^^^^^^^ Scipy's clustering module is rather slow, especially if you have more than 1000 rows or columns. If you have more than 1000, this will attempt to import `fastcluster` (available via `pip install fastcluster`) and use the linkage function there. If you want to always use fastcluster, do `use_fastcluster=True`.