In two previous posts (here and here), I showed how to do occupancy analysis with the beginnings of a Python based version of Hillmaker. The example is based on data from a hospital short stay unit.
In this short tutorial, I'll show how to use rmagic
from within an IPython notebook so that we can make occupancy plots using ggplot2
. In particular, we want to create a grid of occupancy histograms with one grid axis being patient type and the other axis being day of week. We want to make sure that the plots are ordered Sunday, Monday, ..., Saturday and NOT in alphabetical order (Friday, Monday, ..., Wednesday).
The first part of such an analysis leads to a table we call the "by datetime" table. At the end of Part 1 of this tutorial series, we ended up with a csv file called bydate_shortstay_csv.csv. Let's read it in and take a look at it.
You can find the data and the .ipynb
file in my hselab-tutorials github repo. Clone or download a zip.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Make the graphs a bit prettier, and bigger
pd.set_option('display.mpl_style', 'default')
pd.set_option('display.line_width', 5000)
pd.set_option('display.max_columns', 60)
line_width has been deprecated, use display.width instead (currently both are identical)
## Read sample data set and convert string dates to datetimes
bydatetime_df = pd.read_csv('data/bydate_shortstay_csv.csv',parse_dates=['datetime'])
bydatetime_df.head()
category | datetime | arrivals | binofday | binofweek | dayofweek | departures | occupancy | |
---|---|---|---|---|---|---|---|---|
0 | IVT | 1996-01-07 00:00:00 | 0 | 0 | 288 | 6 | 0 | 0.0 |
1 | IVT | 1996-01-07 00:30:00 | 0 | 1 | 289 | 6 | 1 | 0.5 |
2 | IVT | 1996-01-07 01:00:00 | 0 | 2 | 290 | 6 | 0 | 0.0 |
3 | IVT | 1996-01-07 01:30:00 | 0 | 3 | 291 | 6 | 0 | 0.0 |
4 | IVT | 1996-01-07 02:00:00 | 0 | 4 | 292 | 6 | 0 | 0.0 |
5 rows × 8 columns
Let's say that we are interested in average occupancy by day of week (ignoring time of day) by patient type (category). Since each day consists of the same number of equally sized time bins, we can use the Pandas groupby
function to easily compute mean occupancy by day of week.
meanocc_df = bydatetime_df.groupby(['category','dayofweek'])['occupancy'].mean()
meanocc_df
category dayofweek ART 0 1.841435 1 1.700810 2 1.770544 3 1.868866 4 2.144907 5 0.000000 6 0.000000 CAT 0 2.562963 1 2.055556 2 2.181134 3 2.136053 4 2.540683 5 0.620833 6 0.278312 IVT 0 8.670891 1 7.403183 2 7.350810 3 7.557870 4 9.250231 5 0.969792 6 0.283494 MYE 0 2.056250 1 1.873669 2 1.841030 3 1.826505 4 2.022743 5 0.147280 6 0.091293 OTH 0 1.455093 1 1.174363 2 1.124248 3 1.147454 4 1.270544 5 0.000000 6 0.000000 Total 0 16.586632 1 14.207581 2 14.267766 3 14.536748 4 17.229109 5 1.737905 6 0.653098 Name: occupancy, dtype: float64
Even better than just the means would be occupancy histograms by patient type and day of week. That will require a little more work. On the way we'll see how to access individual levels of a hierarchical index, how to push data into an R workspace and use ggplot2 to make nice plots.
Let's start by computing average occupancy by category by date (i.e. roll up over the time bins).
meanoccdate = bydatetime_df.groupby(['category',bydatetime_df['datetime'].map(lambda x: x.date())])['occupancy'].mean()
meanoccdate.head()
category datetime ART 1996-01-07 0.000000 1996-01-08 1.732639 1996-01-09 1.532639 1996-01-10 1.541667 1996-01-11 1.872222 Name: occupancy, dtype: float64
Now we can use this to create a grid of histograms of occupancy by category and day of week. To make our life a little easier, let's add a string day of week column to this new dataframe (yes, the output of the groupby
is a DataFrame
or Series
). This will require us to access the datetime level of the hierarchical index of our Series
object, meanoccdate. What does the index look like for one specific row in meanoccdate?
meanoccdate.index[0]
('ART', datetime.date(1996, 1, 7))
It's a tuple with the index values for each level of the hierarchy. We want the 2nd element in the tuple.
meanoccdate.index[0][1]
datetime.date(1996, 1, 7)
Ok, we've got that part. Now, how to convert a Timestamp
(equivalently, a datetime
), to its day of week label? Reading the datetime docs leads to the strftime
function for creating formatted string representations of datetimes.
ts = pd.Timestamp('6/26/2014')
ts.strftime('%a')
'Thu'
Great. Now just put it all together.
meanoccdate_df = pd.DataFrame(meanoccdate)
meanoccdate_df['dayofweek'] = meanoccdate_df.index.map(lambda x: pd.Timestamp(x[1]).strftime('%a'))
meanoccdate_df.head()
occupancy | dayofweek | ||
---|---|---|---|
category | datetime | ||
ART | 1996-01-07 | 0.000000 | Sun |
1996-01-08 | 1.732639 | Mon | |
1996-01-09 | 1.532639 | Tue | |
1996-01-10 | 1.541667 | Wed | |
1996-01-11 | 1.872222 | Thu |
5 rows × 2 columns
meanoccdate_df.tail()
occupancy | dayofweek | ||
---|---|---|---|
category | datetime | ||
Total | 1996-03-27 | 15.801389 | Wed |
1996-03-28 | 13.538194 | Thu | |
1996-03-29 | 18.146528 | Fri | |
1996-03-30 | 1.436111 | Sat | |
1996-03-31 | 0.227083 | Sun |
5 rows × 2 columns
meanoccdate_df['occupancy']['Total']
datetime 1996-01-07 0.871528 1996-01-08 17.759028 1996-01-09 15.146528 1996-01-10 14.095139 1996-01-11 16.148611 1996-01-12 16.236111 1996-01-13 1.586806 1996-01-14 0.725694 1996-01-15 18.181250 1996-01-16 16.000000 1996-01-17 14.304861 1996-01-18 16.816667 1996-01-19 16.011111 1996-01-20 2.631944 1996-01-21 0.682639 ... 1996-03-17 0.753472 1996-03-18 17.765278 1996-03-19 11.013194 1996-03-20 12.238194 1996-03-21 13.309722 1996-03-22 15.611806 1996-03-23 0.915972 1996-03-24 0.570139 1996-03-25 15.807639 1996-03-26 14.565972 1996-03-27 15.801389 1996-03-28 13.538194 1996-03-29 18.146528 1996-03-30 1.436111 1996-03-31 0.227083 Name: occupancy, Length: 85, dtype: float64
The rmagic extension allows us to magically use R from within IPython. Well, it's not really magic. It relies on the RPy2 Python module.
After loading the rmagic
extension, you can use %R
for a single R line magic command or %%R
for cell magic to handle a sequence of R commands.
# Load the extension
%load_ext rmagic
%matplotlib inline
# Create some data in Python and scatter it
import pylab
X = np.array([0,1,2,3,4])
Y = np.array([3,5,4,6,7])
pylab.scatter(X, Y)
<matplotlib.collections.PathCollection at 0xba87cac>
/home/pcba/anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:1236: UserWarning: findfont: Font family ['monospace'] not found. Falling back to Bitstream Vera Sans (prop.get_family(), self.defaultFamily[fontext]))
# "Push" these two numpy arrays into the R "space"
%Rpush X Y
%%R
linmodel <- lm(Y~X)
print(summary(linmodel))
Call: lm(formula = Y ~ X) Residuals: 1 2 3 4 5 -0.2 0.9 -1.0 0.1 0.2 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.2000 0.6164 5.191 0.0139 * X 0.9000 0.2517 3.576 0.0374 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7958 on 3 degrees of freedom Multiple R-squared: 0.81, Adjusted R-squared: 0.7467 F-statistic: 12.79 on 1 and 3 DF, p-value: 0.03739
So, pretty easy to run R commands from within IPython notebooks. Now let's exploit rmagic to do some ggplotting.
A popular plotting module for Python is matplotlib. The project homepage has many links to resources for learning it, with a very good place to start being the official documentation and the gallery of graphs. The are a few modes of using matplotlib. There is a pyplot mode which is particularly well suited for interactive plotting in a Python shell like IPython (much like one would work in Mathematica or MATLAB). You can also use matplotlib within Python scripts either with the pyplot commands or via an objected oriented API (similar to ggplot2 for plotting in R).
Here is a basic histogram for overall occupancy (there's a 'Total' cateogry in our data frame). For a more API based approach, see this example from the matplotlib page as well as the next version of our histogram below.
# normed=1 plots probs instead of counts, alpha in [0,1] is transparency level (RGBA colors)
plt.hist(meanoccdate_df['occupancy']['Total'], 20, normed=1, facecolor='green', alpha=0.75)
plt.xlabel('Occupancy')
plt.ylabel('Probability')
plt.title(r'Histogram of Short Stay Occupancy')
plt.grid(True)
plt.show()
Clearly there's a day of the week effect - the lower mode is due to the weekends.
Now, let's create the same plot, but using R's ggplot2
package. Our basic strategy is:
%Rpush
command to "push" the meanoccdate_df DataFrame
into the R workspace.%R
line magic to load the ggplot2
library%%R
cell magic to create and show the plotA big question is, what happens to the hiearchical index in the pandas DataFrame
when pushed to R?
%Rpush meanoccdate_df
%R str(meanoccdate_df)
'data.frame': 510 obs. of 2 variables: $ occupancy: num [1:510(1d)] 0 1.73 1.53 1.54 1.87 ... $ dayofweek: Factor w/ 7 levels "Fri","Mon","Sat",..: 4 2 6 7 5 1 3 4 2 6 ...
Bummer, the category (patient type) and date fields that made up the pandas MultiIndex
did not come through. No problem, we'll just create columns from the index before pushing the data frame to R.
meanoccdate_df['patient_type'] = meanoccdate_df.index.map(lambda x: x[0])
meanoccdate_df['date'] = meanoccdate_df.index.map(lambda x: x[1])
%Rpush meanoccdate_df
%R str(meanoccdate_df)
'data.frame': 510 obs. of 4 variables: $ occupancy : num [1:510(1d)] 0 1.73 1.53 1.54 1.87 ... $ dayofweek : Factor w/ 7 levels "Fri","Mon","Sat",..: 4 2 6 7 5 1 3 4 2 6 ... $ patient_type: Factor w/ 6 levels "ART","CAT","IVT",..: 1 1 1 1 1 1 1 1 1 1 ... $ date : Factor w/ 85 levels "1996-01-07","1996-01-08",..: 1 2 3 4 5 6 7 8 9 10 ...
%R library(ggplot2)
array(['ggplot2', 'tools', 'stats', 'graphics', 'grDevices', 'utils', 'datasets', 'methods', 'base'], dtype='|S9')
%%R
g <- ggplot(data=meanoccdate_df[meanoccdate_df$patient_type == 'Total',]) + geom_histogram(aes(x=occupancy, y=..density..), fill="#FF9999", colour="black")
print(g)
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Now let's do a facet plot by patient_type and day of week.
%%R
g2 <- ggplot(data=meanoccdate_df) + geom_histogram(aes(x=occupancy, y=..density..), binwidth=2, fill="#FF9999", colour="black")
print(g2 + facet_grid(patient_type ~ dayofweek))
Notice that the days of the week are not sorted correctly. Why? Recall the structure of our R data.frame
.
%R str(meanoccdate_df)
'data.frame': 510 obs. of 4 variables: $ occupancy : num [1:510(1d)] 0 1.73 1.53 1.54 1.87 ... $ dayofweek : Factor w/ 7 levels "Fri","Mon","Sat",..: 4 2 6 7 5 1 3 4 2 6 ... $ patient_type: Factor w/ 6 levels "ART","CAT","IVT",..: 1 1 1 1 1 1 1 1 1 1 ... $ date : Factor w/ 85 levels "1996-01-07","1996-01-08",..: 1 2 3 4 5 6 7 8 9 10 ...
The dayofweek column is a Factor. We want it to be an Ordered Factor and we want to tell R the order to use.
%%R
# Create vector with DOWs ordered as you wish
DOW_order <- c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")
# Change DOW from factor to ordered factor using vector you just made
meanoccdate_df$dayofweek <- factor(meanoccdate_df$dayofweek,levels=DOW_order,ordered=TRUE)
%R str(meanoccdate_df)
'data.frame': 510 obs. of 4 variables: $ occupancy : num [1:510(1d)] 0 1.73 1.53 1.54 1.87 ... $ dayofweek : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 1 2 3 4 5 6 7 1 2 3 ... $ patient_type: Factor w/ 6 levels "ART","CAT","IVT",..: 1 1 1 1 1 1 1 1 1 1 ... $ date : Factor w/ 85 levels "1996-01-07","1996-01-08",..: 1 2 3 4 5 6 7 8 9 10 ...
%%R
g2 <- ggplot(data=meanoccdate_df) + geom_histogram(aes(x=occupancy, y=..density..), binwidth=2, fill="#FF9999", colour="black")
print(g2 + facet_grid(patient_type ~ dayofweek))
And now we've got the DOW ordering we really wanted. This has been a quick tour of plotting occupancy data using ggplot2 from within an IPython notebook. Future installments of this series of occupancy analysis will focus on more occupancy visualizations using some combination of R and Python.