ggplot is a Python port of the popular ggplot2 R implementation by Hadley of Wickham of many the ideas proposed in Leland Wilkinson's The Grammar of Graphics. As such, it attempts to enforce good practice in the generation of charts from appropriately shaped datasets.
The "default" graphics library for use with pandas is arguably the matplotlib 2D Python plotting library, but whilst matplotlib offers a wide variety of powerful charting capabilities, it is often less concise and more convoluted than ggplot. It doesn't produce charts that are quite as pretty (or professional looking) as ggplot does out of the box either!
ggplot also offers a cleaner separation of data and graphical transformations, in accord with Wilkinson's original model. The full original ggplot2 implementation was also developed with the production of statistical graphics in mind. That is, the library provided a range of statistical transformations that could be applied to a dataset as part of the graphic generation process.
The full ggplot documentation can be found here: ggplot documentation.
The python statsmodels library is one of the more widely use statistical computing libraries for Python, providing a range of powerful chart types as well as employing pandas based data structres for representing datasets. But whilst it statsmodels does support the generation of powerful statistical charts, it is rather lacking in support of simpler chart types.
So on these grounds of what we might term, at worst, principled expediency, we will tend to focus on the use of ggplot although you are free to explore other charting libraries yourself. If you particularly want to make use of interactive Javascript style charts, howver, Vincent could be a good choice. If you prefer using matplotlib, that's fine too. However, we do expect that you also gain an understanding of how to make use of graphics libraries based on the ideas of The Grammar of Graphics.
import pandas as pd
The data we will be using in this notebook was released under a Freedom of Information request to the Isle of Wight Council and describes the revenue taken by two ticket machines in a particular pay and display car park over a twelve month period. (You can see the original FOI request here: Pay and display ticket machine logs.)
The data is supplied in a set of Excel spreadsheets. We will open just one for now.
!ls data
CarParks.kml anscombesQuartet_longish.tsv anscombesQuartet.csv iw_parkingMeterData anscombesQuartet_hier.csv pay_and_display_ticket_machine_l.zip anscombesQuartet_long.tsv tmpfile.csv
! unzip data/pay_and_display_ticket_machine_l.zip -P
Archive: data/pay_and_display_ticket_machine_l.zip caution: filename not matched: -P
df=pd.read_excel("data/iw_parkingMeterData/4_5_Transaction Report RR Dec 2012 March 2013.xls")
df[:10]
Transaction Report | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | 2014-03-17 15:25:47.526000 | ||||
---|---|---|---|---|---|---|---|---|---|
NaN | NaN | NaN | NaN | 01/12/2012 00:00 - 31/03/2013 00:00 | NaN | NaN | NaN | NaN | NaN |
NaN | Machines : YARR01, YARR02 | NaN | NaN | NaN | NaN | NaN | |||
NaN | NaN | NaN | NaN | NaN | NaN | NaN | |||
NaN | Tariffs : ALL | NaN | NaN | NaN | NaN | NaN | |||
NaN | NaN | NaN | NaN | NaN | NaN | NaN | |||
Date | Machine | Description | NaN | NaN | Tariff | Description | NaN | Cash | NaN |
2012-12-01 06:38:53 | YARR01 | River Road 1 Yarmouth | NaN | NaN | 01F | LS with Cch £6.60 6>24hrs | NaN | 6.6 | NaN |
2012-12-01 07:26:12 | YARR01 | River Road 1 Yarmouth | NaN | NaN | 01F | LS with Cch £6.60 6>24hrs | NaN | 6.6 | NaN |
2012-12-01 08:22:15 | YARR01 | River Road 1 Yarmouth | NaN | NaN | 01F | LS with Cch £6.60 6>24hrs | NaN | 6.6 | NaN |
2012-12-01 08:27:01 | YARR01 | River Road 1 Yarmouth | NaN | NaN | 01F | LS with Cch £6.60 6>24hrs | NaN | 6.6 | NaN |
By inspection, we see that there are six rows before the header row. Let's try loading the data in by skipping those rows.
df=pd.read_excel("data/iw_parkingMeterData/4_5_Transaction Report RR Dec 2012 March 2013.xls", \
skiprows=6)
There appear to be some columns that just contain NaN values - let's tidy the dataset a little by dropping those columns.
df.dropna(how='all',axis=1,inplace=True)
df[:10]
Date | Machine | Description | Tariff | Description.1 | Cash | |
---|---|---|---|---|---|---|
0 | 2012-12-01 06:38:53 | YARR01 | River Road 1 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 |
1 | 2012-12-01 07:26:12 | YARR01 | River Road 1 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 |
2 | 2012-12-01 08:22:15 | YARR01 | River Road 1 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 |
3 | 2012-12-01 08:27:01 | YARR01 | River Road 1 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 |
4 | 2012-12-01 08:34:11 | YARR01 | River Road 1 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 |
5 | 2012-12-01 08:37:35 | YARR01 | River Road 1 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 |
6 | 2012-12-01 08:38:24 | YARR01 | River Road 1 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 |
7 | 2012-12-01 08:39:06 | YARR01 | River Road 1 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 |
8 | 2012-12-01 08:45:51 | YARR01 | River Road 1 Yarmouth | 01B | LS with Cch £1.00 30m>1hr | 1.0 |
9 | 2012-12-01 08:47:50 | YARR01 | River Road 1 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 |
#Check to see how the columns are typed
df.dtypes
Date object Machine object Description object Tariff object Description.1 object Cash float64 dtype: object
Let's do a little more tidying:
#Cast the date column as a date type, specifying how to parse the dates.
#Set coerce = True to cast any strings that aren't recognised to a NaT value.
df.Date=pd.to_datetime(df.Date, format="%Y-%m-%d %H:%M:%S",coerce=True)
#The final row - which originally had a "Date" labelled Total was actually a total row
df[-4:]
Date | Machine | Description | Tariff | Description.1 | Cash | |
---|---|---|---|---|---|---|
6201 | 2013-03-30 18:24:11 | YARR02 | River Road 2 Yarmouth | 01B | LS with Cch £1.00 30m>1hr | 1.00 |
6202 | 2013-03-30 20:47:47 | YARR02 | River Road 2 Yarmouth | 01A | LS with Cch £0.60 30 Mins | 0.60 |
6203 | 2013-03-30 21:58:09 | YARR02 | River Road 2 Yarmouth | 01C | LS with Cch £1.90 1->2hrs | 2.00 |
6204 | NaT | 6204 | NaN | 6204 | NaN | 18385.85 |
#Let's see if any other dates weren;t recognised as such
df[df["Date"].isnull()]
Date | Machine | Description | Tariff | Description.1 | Cash | |
---|---|---|---|---|---|---|
6204 | NaT | 6204 | NaN | 6204 | NaN | 18385.85 |
#Let's also just check the total by summing the values (except the total) in the Cash column
df[['Cash']][:-1].sum()
Cash 18385.85 dtype: float64
#Let's just check the row count too
df[['Cash']][:-1].count()
Cash 6204 dtype: int64
#Drop the final total row
df.dropna(subset=["Date"],inplace=True)
#The ggplot library is currently under active development
#Grab the most recent version from the github repository
#!pip3 uninstall -y ggplot
#!pip3 install git+https://github.com/yhat/ggplot.git
To get started with ggplot we need to load it in.
from ggplot import *
We call ggplot
by passing in a dataframe and specifying the aesthetic mappings. We need to make sure we define an appropriate aesthetic mapping for each dimnsion we wish to represent in the final display.
We use geom_point()
to generate a scatterplot. geom_point()
requires at least x and y value mappings. In the aes()
definition, assign the x and y coordinate axes to the (quoted) column names you wish to plot from the specified dataset.
ggplot(df, aes(x="Date",y="Cash"))+geom_point()
<ggplot: (8760285807957)>
ggplot
charts are constructed on a layered basis. the geom_title()
layer can be used to add a title to the chart.
ggplot(df, aes(x="Date",y="Cash")) + geom_point() \
+ ggtitle("Payments made over time")
<ggplot: (8760285779534)>
Another useful layer for styling the presentation of a chart is the labs()
layer that can be used to set axis labels.
Note that ggplot actually returns a chart object, which means that we can assign it to a variable and add further layers or modifcations to that variable or chart object.
g = ggplot(df, aes(x="Date",y="Cash")) + geom_point()
g = g + ggtitle("Payments made over time")
g = g + labs("Transaction Date", "Transaction amount (£)")
g
<ggplot: (-9223363276568934428)>
Whilst a very simple technique, counting is often one of the most useful tools in our toolbox. For example, let's count how many tickets were issued by each machine for each tariff by aggregating over each group using the len
function.
df[["Tariff","Machine"]].groupby(['Tariff',"Machine"]).agg(len).sort_index()
Tariff Machine 01A YARR01 133 YARR02 192 01B YARR01 627 YARR02 595 01C YARR01 1022 YARR02 1014 01D YARR01 572 YARR02 488 01E YARR01 302 YARR02 222 01F YARR01 564 YARR02 463 01G YARR01 1 YARR02 6 01H YARR02 3 dtype: int64
Bar charts can be used to provide charts showing counts across categorical variables. Supply the categorical variable you wish to chart as the x value and use geom_bar()
.
Note that if variable (foo
) you wish to use for the categorical x-values has a numeric type, you can cast it to a categorical variable by calling it as follows: x='factor(foo)'
.
p = ggplot(aes(x='Tariff'), data=df)
p + geom_bar() + ggtitle("Number of Tickets per Tariff") + labs("Tariff Code", "Count")
<ggplot: (8760286198953)>
We can add in a grouping variable to produce a stacked bar chart. Here we can see the contribution to the total count in each tariff made from each machine.
p = ggplot(aes(x='Tariff',fill="Machine"), data=df)
p + geom_bar() + ggtitle("Number of Tickets per Tariff") + labs("Tariff Code", "Count")
<ggplot: (8760290565552)>
If the range we want along the horizontal x-axis is a continuous one, we can make use of geom_histogram()
. The binwidth will be automatically calculated, but we can also force it to a particular width using the binwidth parameter.
Here I set the binwidth to 0.1, that is, 10 pence, so we can more closely look for small overpayments.
p = ggplot(aes(x='Cash'), data=df)
p + geom_histogram(binwidth=0.1)
<ggplot: (8760290693787)>
Sometimes, whilst we can plot a chart, it may not really be meaningful to do so.
For situations where you have a continuous distribution of values along a continuous numerical axis, it may may sense to produce a frequency distribution chart. For example, a chart showing the distribution of the height of people in a population. Thegeo,_density()
chart works out the relative frequency of each value and produces a smoothed curve that extimates the continuos (frequency) distribution. The vertical y-axis is the propotion of the population taking the value. The area under the curve should sum to 1.
The Cash payment received is, in a sense, a continuous variable (people could pay any amount, at least in steps of 5 pence, the smallest coin accepted by the parking meters) although the expectation is that only discrete amounts (as specified by the tariffs) are required as payment.
p = ggplot(aes(x='Cash'), data=df)
p + geom_density() + ggtitle("Number of Tickets per Tariff") + labs("Payment (£)", "Proportion")
<ggplot: (8760285754537)>
From the peaks in the chart, we see peaks at about 50 pence, just below £2, at about £3.30, a small bump about £4.40 and a final burst at about £6.60.
We can lookup the tariff amounts from the Description.1 column:
df['Description.1'].unique()
array(['LS with Cch £6.60 6>24hrs', 'LS with Cch £1.00 30m>1hr', 'LS with Cch £4.50 4->6hrs', 'LS with Cch £1.90 1->2hrs', 'LS with Cch £3.40 2->4hrs', 'LS with Cch £0.60 30 Mins', 'LS with Cch £3.00 Cch>10h', 'LS with Cch £10 Cch10>14h'], dtype=object)
That is, we have distinct payment amounts at £10, £6.60, £4.50, £3.40, £3.00, £1.90 and £0.60.
We can add additional layers to the chart to hightlight these valuese using geom_vline()
, which adds a vertical line at a particular x value.
p = ggplot(aes(x='Cash'), data=df)
p = p + geom_density() + ggtitle("Number of Tickets per Tariff") + labs("Payment (£)", "Proportion")
p + geom_vline(xintercept=[10, 6.6, 4.5, 3.4, 3.0, 1.9, 1,0, 0.6 ],colour='blue')
<ggplot: (8760285886149)>
We see high frequency spikes at all these amounts apart from at the £3 tariff (short stay coaches).
Note that we can set the extent of the x and y axes by adding + xlim(MIN_X, MAX_X)
and + ylim(MIN_Y, MAX_Y)
modification layers to the chart.
For charting continuous values, particular ones that are plotted over time, a line chart often makes most sense.
When looking at transaction reports including separate amounts for each trasnaction, a chart showing the running total or accumulated amount can often be useful
We can add such a value as an additional column by sorting the data frame appropriately and then calculaing the cumulative sun over the Cash column using the cumsum()
method.
df.sort(['Date'],inplace=True)
df['Cash_cumul'] = df.Cash.cumsum()
df[:5]
Date | Machine | Description | Tariff | Description.1 | Cash | Cash_cumul | |
---|---|---|---|---|---|---|---|
0 | 2012-12-01 06:38:53 | YARR01 | River Road 1 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 | 6.6 |
1 | 2012-12-01 07:26:12 | YARR01 | River Road 1 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 | 13.2 |
3221 | 2012-12-01 07:30:14 | YARR02 | River Road 2 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 7.0 | 20.2 |
3222 | 2012-12-01 07:40:09 | YARR02 | River Road 2 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 | 26.8 |
3223 | 2012-12-01 07:58:18 | YARR02 | River Road 2 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 | 33.4 |
To plot the value as a line chart, we use geom_line()
.
#As well as passing a dataframe to the ggplot function as the first argument, we can also pass it via the data= attribute
g = ggplot(aes(x="Date",y="Cash_cumul"), data=df )+ geom_line()
g
<ggplot: (-9223363276568721632)>
Modify the chart generated directly above by adding an appropriate title and tidying up the axis titles.
#Add title
#Add suitable axis labels, and then display the chart
Judicious use of colour can often help us pack more information into a chart in a way that still allows us to read it. For example, suppose we want to look at the accumulated spend over time by Tariff to see which Tariff appears to be generating most revenue.
We can group on the tariff and calculate the accumulated revenue within each tariff band.
group=df[['Tariff','Cash']].groupby('Tariff')
#For group of rows, apply the transformation to each row in the group
#The number of rows in the response will be the same as the number of rows in the original data frame
df['Cash_cumul2']=group.transform(cumsum)['Cash']
df[:10]
Date | Machine | Description | Tariff | Description.1 | Cash | Cash_cumul | Cash_cumul2 | |
---|---|---|---|---|---|---|---|---|
0 | 2012-12-01 06:38:53 | YARR01 | River Road 1 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 | 6.6 | 6.6 |
1 | 2012-12-01 07:26:12 | YARR01 | River Road 1 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 | 13.2 | 13.2 |
3221 | 2012-12-01 07:30:14 | YARR02 | River Road 2 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 7.0 | 20.2 | 20.2 |
3222 | 2012-12-01 07:40:09 | YARR02 | River Road 2 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 | 26.8 | 26.8 |
3223 | 2012-12-01 07:58:18 | YARR02 | River Road 2 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 | 33.4 | 33.4 |
3224 | 2012-12-01 08:06:07 | YARR02 | River Road 2 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 | 40.0 | 40.0 |
2 | 2012-12-01 08:22:15 | YARR01 | River Road 1 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 | 46.6 | 46.6 |
3225 | 2012-12-01 08:25:33 | YARR02 | River Road 2 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 | 53.2 | 53.2 |
3 | 2012-12-01 08:27:01 | YARR01 | River Road 1 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 | 59.8 | 59.8 |
4 | 2012-12-01 08:34:11 | YARR01 | River Road 1 Yarmouth | 01F | LS with Cch £6.60 6>24hrs | 6.6 | 66.4 | 66.4 |
We can now plot the tariff based culumative totals as separate lines, splitting each line out using the colour
aesthetic.
ggplot(df,aes(x="Date",y="Cash_cumul2",colour="Tariff"))+geom_line()
<ggplot: (8760290711777)>
On other occasions, we may wish to split out data from different groups into different charts. This is referred to as faceting. We can split a datset across separate charts based on the value of a particular group attribute by using the facet_wrap()
layer.
ggplot(df, aes(x="Date",y="Cash_cumul2")) + geom_line() \
+ ggtitle("Payments made over time") \
+ labs("Transaction Date", "Transaction amount (£)") \
+ facet_wrap("Tariff")
<ggplot: (8760286052826)>
By default, axis values are generated for each chart independently. However, we can also force them to use the same axes by setting the scales
parameter to fixed
, as opposed to free
.
ggplot(df, aes(x="Date",y="Cash_cumul2")) + geom_line() \
+ ggtitle("Payments made over time") \
+ labs("Transaction Date", "Transaction amount (£)") \
+ facet_wrap("Tariff",scales = "fixed")
<ggplot: (8760285536211)>
How many ticket machines is the data collected from and how many transactions are recorded by each one?
How would you generate a faceted chart showing the accumulated transactions over time for each of the ticket machines identified in the ticket column?
# Identifying the number of distinct machines and number of transactions recorded by each one
# Accumulated total for each machine
#Chart faceted by ticket machine
When it comes to publishing charts in a particular publication, we often require that the chart is presented in a particular style. In the same way that we can use different CSS style files to alter the look of a particular HTML document, so we can alter the look of a chart generated using ggplot by applying differnt themes to the chart.
For example, if we need to inject a little humour or apparent "casualness" into a chart, at the expense of some accuracy in the chart, we can use the XCKD theme.
(For more information about the use of such a theme, and techniques for creating such "sketchy visualiastions", see Wood, Jo, Petra Isenberg, Tobias Isenberg, Jason Dykes, Nadia Boukhelifa, and Aidan Slingsby. "Sketchy rendering for information visualization." IEEE Transactions on Visualization and Computer Graphics, 18(12), 2012: 2749-2758.)
p = ggplot(aes(x='Tariff',fill="Machine"), data=df)
p = p + geom_bar() + ggtitle("Number of Tickets per Tariff") + labs("Tariff Code", "Count")
p + theme_xkcd()
<ggplot: (8760285384200)>
Another very useful theme is the "clean" looking theme_bw()
.
p + theme_bw()
<ggplot: (8760286107582)>
In this notebook,we have introduced some of the basic chart types that are supported by ggplot, as well as some of the modifications you can make to the charts. You can find further information, as well as descriptions of additioanl chart types, from the ggplot documentation.
Feel free to extend this notebook as your own personal reference notebook by adding further sections about additional chart types.
If you are working through this notebook as part of an inline exercise, return to the course materials now. If you are working through this set of notebooks as a whole, move on to 4.5.3 Getting Started With Maps - folium.