Getting Started With ggplot¶

ggplot is a Python port of the popular ggplot2 R implementation by Hadley of Wickham of many the ideas proposed in Leland Wilkinson's The Grammar of Graphics. As such, it attempts to enforce good practice in the generation of charts from appropriately shaped datasets.

The "default" graphics library for use with pandas is arguably the matplotlib 2D Python plotting library, but whilst matplotlib offers a wide variety of powerful charting capabilities, it is often less concise and more convoluted than ggplot. It doesn't produce charts that are quite as pretty (or professional looking) as ggplot does out of the box either!

ggplot also offers a cleaner separation of data and graphical transformations, in accord with Wilkinson's original model. The full original ggplot2 implementation was also developed with the production of statistical graphics in mind. That is, the library provided a range of statistical transformations that could be applied to a dataset as part of the graphic generation process.

The full ggplot documentation can be found here: ggplot documentation.

The python statsmodels library is one of the more widely use statistical computing libraries for Python, providing a range of powerful chart types as well as employing pandas based data structres for representing datasets. But whilst it statsmodels does support the generation of powerful statistical charts, it is rather lacking in support of simpler chart types.

So on these grounds of what we might term, at worst, principled expediency, we will tend to focus on the use of ggplot although you are free to explore other charting libraries yourself. If you particularly want to make use of interactive Javascript style charts, howver, Vincent could be a good choice. If you prefer using matplotlib, that's fine too. However, we do expect that you also gain an understanding of how to make use of graphics libraries based on the ideas of The Grammar of Graphics.

Finding Some Data...¶

In [37]:

import pandas as pd

The data we will be using in this notebook was released under a Freedom of Information request to the Isle of Wight Council and describes the revenue taken by two ticket machines in a particular pay and display car park over a twelve month period. (You can see the original FOI request here: Pay and display ticket machine logs.)

The data is supplied in a set of Excel spreadsheets. We will open just one for now.

In [38]:

!ls data

CarParks.kml		   anscombesQuartet_longish.tsv
anscombesQuartet.csv	   iw_parkingMeterData
anscombesQuartet_hier.csv  pay_and_display_ticket_machine_l.zip
anscombesQuartet_long.tsv  tmpfile.csv

In [39]:

! unzip data/pay_and_display_ticket_machine_l.zip -P

Archive:  data/pay_and_display_ticket_machine_l.zip
caution: filename not matched:  -P

In [40]:

df=pd.read_excel("data/iw_parkingMeterData/4_5_Transaction Report RR Dec 2012 March 2013.xls")
df[:10]

Out[40]:

				Transaction Report	Unnamed: 1	Unnamed: 2	Unnamed: 3	Unnamed: 4	2014-03-17 15:25:47.526000
NaN	NaN	NaN	NaN	01/12/2012 00:00 - 31/03/2013 00:00	NaN	NaN	NaN	NaN	NaN
			NaN	Machines : YARR01, YARR02	NaN	NaN	NaN	NaN	NaN
			NaN	NaN	NaN	NaN	NaN	NaN	NaN
			NaN	Tariffs : ALL	NaN	NaN	NaN	NaN	NaN
			NaN	NaN	NaN	NaN	NaN	NaN	NaN
Date	Machine	Description	NaN	NaN	Tariff	Description	NaN	Cash	NaN
2012-12-01 06:38:53	YARR01	River Road 1 Yarmouth	NaN	NaN	01F	LS with Cch £6.60 6>24hrs	NaN	6.6	NaN
2012-12-01 07:26:12	YARR01	River Road 1 Yarmouth	NaN	NaN	01F	LS with Cch £6.60 6>24hrs	NaN	6.6	NaN
2012-12-01 08:22:15	YARR01	River Road 1 Yarmouth	NaN	NaN	01F	LS with Cch £6.60 6>24hrs	NaN	6.6	NaN
2012-12-01 08:27:01	YARR01	River Road 1 Yarmouth	NaN	NaN	01F	LS with Cch £6.60 6>24hrs	NaN	6.6	NaN

By inspection, we see that there are six rows before the header row. Let's try loading the data in by skipping those rows.

In [41]:

df=pd.read_excel("data/iw_parkingMeterData/4_5_Transaction Report RR Dec 2012 March 2013.xls", \
                 skiprows=6)

There appear to be some columns that just contain NaN values - let's tidy the dataset a little by dropping those columns.

In [42]:

df.dropna(how='all',axis=1,inplace=True)
df[:10]

Out[42]:

	Date	Machine	Description	Tariff	Description.1	Cash
0	2012-12-01 06:38:53	YARR01	River Road 1 Yarmouth	01F	LS with Cch £6.60 6>24hrs	6.6
1	2012-12-01 07:26:12	YARR01	River Road 1 Yarmouth	01F	LS with Cch £6.60 6>24hrs	6.6
2	2012-12-01 08:22:15	YARR01	River Road 1 Yarmouth	01F	LS with Cch £6.60 6>24hrs	6.6
3	2012-12-01 08:27:01	YARR01	River Road 1 Yarmouth	01F	LS with Cch £6.60 6>24hrs	6.6
4	2012-12-01 08:34:11	YARR01	River Road 1 Yarmouth	01F	LS with Cch £6.60 6>24hrs	6.6
5	2012-12-01 08:37:35	YARR01	River Road 1 Yarmouth	01F	LS with Cch £6.60 6>24hrs	6.6
6	2012-12-01 08:38:24	YARR01	River Road 1 Yarmouth	01F	LS with Cch £6.60 6>24hrs	6.6
7	2012-12-01 08:39:06	YARR01	River Road 1 Yarmouth	01F	LS with Cch £6.60 6>24hrs	6.6
8	2012-12-01 08:45:51	YARR01	River Road 1 Yarmouth	01B	LS with Cch £1.00 30m>1hr	1.0
9	2012-12-01 08:47:50	YARR01	River Road 1 Yarmouth	01F	LS with Cch £6.60 6>24hrs	6.6

In [43]:

#Check to see how the columns are typed
df.dtypes

Out[43]:

Date              object
Machine           object
Description       object
Tariff            object
Description.1     object
Cash             float64
dtype: object

Let's do a little more tidying:

In [44]:

#Cast the date column as a date type, specifying how to parse the dates.
#Set coerce = True to cast any strings that aren't recognised to a NaT value.
df.Date=pd.to_datetime(df.Date,  format="%Y-%m-%d %H:%M:%S",coerce=True)
#The final row - which originally had a "Date" labelled Total was actually a total row
df[-4:]

Out[44]:

	Date	Machine	Description	Tariff	Description.1	Cash
6201	2013-03-30 18:24:11	YARR02	River Road 2 Yarmouth	01B	LS with Cch £1.00 30m>1hr	1.00
6202	2013-03-30 20:47:47	YARR02	River Road 2 Yarmouth	01A	LS with Cch £0.60 30 Mins	0.60
6203	2013-03-30 21:58:09	YARR02	River Road 2 Yarmouth	01C	LS with Cch £1.90 1->2hrs	2.00
6204	NaT	6204	NaN	6204	NaN	18385.85

In [45]:

#Let's see if any other dates weren;t recognised as such
df[df["Date"].isnull()]

Out[45]:

	Date	Machine	Description	Tariff	Description.1	Cash
6204	NaT	6204	NaN	6204	NaN	18385.85

In [46]:

#Let's also just check the total by summing the values (except the total) in the Cash column
df[['Cash']][:-1].sum()

Out[46]:

Cash    18385.85
dtype: float64

In [47]:

#Let's just check the row count too
df[['Cash']][:-1].count()

Out[47]:

Cash    6204
dtype: int64

In [48]:

#Drop the final total row
df.dropna(subset=["Date"],inplace=True)

Using ggplot¶

In [49]:

#The ggplot library is currently under active development
#Grab the most recent version from the github repository
#!pip3 uninstall -y ggplot
#!pip3 install git+https://github.com/yhat/ggplot.git

To get started with ggplot we need to load it in.

In [50]:

from ggplot import *

We call ggplot by passing in a dataframe and specifying the aesthetic mappings. We need to make sure we define an appropriate aesthetic mapping for each dimnsion we wish to represent in the final display.

geom_point()¶

We use geom_point() to generate a scatterplot. geom_point() requires at least x and y value mappings. In the aes() definition, assign the x and y coordinate axes to the (quoted) column names you wish to plot from the specified dataset.

In [51]:

ggplot(df, aes(x="Date",y="Cash"))+geom_point()

Out[51]:

<ggplot: (8760285807957)>

ggtitle()¶

ggplot charts are constructed on a layered basis. the geom_title() layer can be used to add a title to the chart.

In [52]:

ggplot(df, aes(x="Date",y="Cash")) + geom_point() \
                                   + ggtitle("Payments made over time")

Out[52]:

<ggplot: (8760285779534)>

labs()¶

Another useful layer for styling the presentation of a chart is the labs() layer that can be used to set axis labels.

Note that ggplot actually returns a chart object, which means that we can assign it to a variable and add further layers or modifcations to that variable or chart object.

In [53]:

g = ggplot(df, aes(x="Date",y="Cash")) + geom_point()
g = g + ggtitle("Payments made over time")
g = g + labs("Transaction Date", "Transaction amount (£)")
g

Out[53]:

<ggplot: (-9223363276568934428)>

Bar Charts and Histograms¶

Whilst a very simple technique, counting is often one of the most useful tools in our toolbox. For example, let's count how many tickets were issued by each machine for each tariff by aggregating over each group using the len function.

In [54]:

df[["Tariff","Machine"]].groupby(['Tariff',"Machine"]).agg(len).sort_index()

Out[54]:

Tariff  Machine
01A     YARR01      133
        YARR02      192
01B     YARR01      627
        YARR02      595
01C     YARR01     1022
        YARR02     1014
01D     YARR01      572
        YARR02      488
01E     YARR01      302
        YARR02      222
01F     YARR01      564
        YARR02      463
01G     YARR01        1
        YARR02        6
01H     YARR02        3
dtype: int64

Bar charts can be used to provide charts showing counts across categorical variables. Supply the categorical variable you wish to chart as the x value and use geom_bar().

Note that if variable (foo) you wish to use for the categorical x-values has a numeric type, you can cast it to a categorical variable by calling it as follows: x='factor(foo)'.

In [55]:

p = ggplot(aes(x='Tariff'), data=df)
p + geom_bar() + ggtitle("Number of Tickets per Tariff")  + labs("Tariff Code", "Count")

Out[55]:

<ggplot: (8760286198953)>

We can add in a grouping variable to produce a stacked bar chart. Here we can see the contribution to the total count in each tariff made from each machine.

In [56]:

p = ggplot(aes(x='Tariff',fill="Machine"), data=df)
p + geom_bar() + ggtitle("Number of Tickets per Tariff")  + labs("Tariff Code", "Count")

Out[56]:

<ggplot: (8760290565552)>

If the range we want along the horizontal x-axis is a continuous one, we can make use of geom_histogram(). The binwidth will be automatically calculated, but we can also force it to a particular width using the binwidth parameter.

Here I set the binwidth to 0.1, that is, 10 pence, so we can more closely look for small overpayments.

In [57]:

p = ggplot(aes(x='Cash'), data=df)
p + geom_histogram(binwidth=0.1) 

Out[57]:

<ggplot: (8760290693787)>

Frequency Distributions - geom_density()¶

Sometimes, whilst we can plot a chart, it may not really be meaningful to do so.

For situations where you have a continuous distribution of values along a continuous numerical axis, it may may sense to produce a frequency distribution chart. For example, a chart showing the distribution of the height of people in a population. Thegeo,_density() chart works out the relative frequency of each value and produces a smoothed curve that extimates the continuos (frequency) distribution. The vertical y-axis is the propotion of the population taking the value. The area under the curve should sum to 1.

The Cash payment received is, in a sense, a continuous variable (people could pay any amount, at least in steps of 5 pence, the smallest coin accepted by the parking meters) although the expectation is that only discrete amounts (as specified by the tariffs) are required as payment.

In [58]:

p = ggplot(aes(x='Cash'), data=df)
p + geom_density() + ggtitle("Number of Tickets per Tariff")  + labs("Payment (£)", "Proportion")

Out[58]:

<ggplot: (8760285754537)>

From the peaks in the chart, we see peaks at about 50 pence, just below £2, at about £3.30, a small bump about £4.40 and a final burst at about £6.60.

We can lookup the tariff amounts from the Description.1 column:

In [59]:

df['Description.1'].unique()

Out[59]:

array(['LS with Cch £6.60 6>24hrs', 'LS with Cch £1.00 30m>1hr',
       'LS with Cch £4.50 4->6hrs', 'LS with Cch £1.90 1->2hrs',
       'LS with Cch £3.40 2->4hrs', 'LS with Cch £0.60 30 Mins',
       'LS with Cch £3.00 Cch>10h', 'LS with Cch £10 Cch10>14h'], dtype=object)

That is, we have distinct payment amounts at £10, £6.60, £4.50, £3.40, £3.00, £1.90 and £0.60.

We can add additional layers to the chart to hightlight these valuese using geom_vline(), which adds a vertical line at a particular x value.

In [60]:

p = ggplot(aes(x='Cash'), data=df)
p = p + geom_density() + ggtitle("Number of Tickets per Tariff")  + labs("Payment (£)", "Proportion")
p + geom_vline(xintercept=[10, 6.6, 4.5, 3.4, 3.0, 1.9, 1,0, 0.6 ],colour='blue') 

Out[60]:

<ggplot: (8760285886149)>

We see high frequency spikes at all these amounts apart from at the £3 tariff (short stay coaches).

Note that we can set the extent of the x and y axes by adding + xlim(MIN_X, MAX_X) and + ylim(MIN_Y, MAX_Y) modification layers to the chart.

Line Charts - geom_line()¶

For charting continuous values, particular ones that are plotted over time, a line chart often makes most sense.

When looking at transaction reports including separate amounts for each trasnaction, a chart showing the running total or accumulated amount can often be useful

We can add such a value as an additional column by sorting the data frame appropriately and then calculaing the cumulative sun over the Cash column using the cumsum() method.

In [61]:

df.sort(['Date'],inplace=True)
df['Cash_cumul'] = df.Cash.cumsum()
df[:5]

Out[61]:

	Date	Machine	Description	Tariff	Description.1	Cash	Cash_cumul
0	2012-12-01 06:38:53	YARR01	River Road 1 Yarmouth	01F	LS with Cch £6.60 6>24hrs	6.6	6.6
1	2012-12-01 07:26:12	YARR01	River Road 1 Yarmouth	01F	LS with Cch £6.60 6>24hrs	6.6	13.2
3221	2012-12-01 07:30:14	YARR02	River Road 2 Yarmouth	01F	LS with Cch £6.60 6>24hrs	7.0	20.2
3222	2012-12-01 07:40:09	YARR02	River Road 2 Yarmouth	01F	LS with Cch £6.60 6>24hrs	6.6	26.8
3223	2012-12-01 07:58:18	YARR02	River Road 2 Yarmouth	01F	LS with Cch £6.60 6>24hrs	6.6	33.4

To plot the value as a line chart, we use geom_line().

In [62]:

#As well as passing a dataframe to the ggplot function as the first argument, we can also pass it via the data= attribute 
g = ggplot(aes(x="Date",y="Cash_cumul"), data=df )+ geom_line()
g

Out[62]:

<ggplot: (-9223363276568721632)>

Exercise¶

Modify the chart generated directly above by adding an appropriate title and tidying up the axis titles.

In [63]:

#Add title

#Add suitable axis labels, and then display the chart

Grouping by Colour¶

Judicious use of colour can often help us pack more information into a chart in a way that still allows us to read it. For example, suppose we want to look at the accumulated spend over time by Tariff to see which Tariff appears to be generating most revenue.

We can group on the tariff and calculate the accumulated revenue within each tariff band.

In [64]:

group=df[['Tariff','Cash']].groupby('Tariff')
#For group of rows, apply the transformation to each row in the group
#The number of rows in the response will be the same as the number of rows in the original data frame
df['Cash_cumul2']=group.transform(cumsum)['Cash']
df[:10]

Out[64]:

	Date	Machine	Description	Tariff	Description.1	Cash	Cash_cumul	Cash_cumul2
0	2012-12-01 06:38:53	YARR01	River Road 1 Yarmouth	01F	LS with Cch £6.60 6>24hrs	6.6	6.6	6.6
1	2012-12-01 07:26:12	YARR01	River Road 1 Yarmouth	01F	LS with Cch £6.60 6>24hrs	6.6	13.2	13.2
3221	2012-12-01 07:30:14	YARR02	River Road 2 Yarmouth	01F	LS with Cch £6.60 6>24hrs	7.0	20.2	20.2
3222	2012-12-01 07:40:09	YARR02	River Road 2 Yarmouth	01F	LS with Cch £6.60 6>24hrs	6.6	26.8	26.8
3223	2012-12-01 07:58:18	YARR02	River Road 2 Yarmouth	01F	LS with Cch £6.60 6>24hrs	6.6	33.4	33.4
3224	2012-12-01 08:06:07	YARR02	River Road 2 Yarmouth	01F	LS with Cch £6.60 6>24hrs	6.6	40.0	40.0
2	2012-12-01 08:22:15	YARR01	River Road 1 Yarmouth	01F	LS with Cch £6.60 6>24hrs	6.6	46.6	46.6
3225	2012-12-01 08:25:33	YARR02	River Road 2 Yarmouth	01F	LS with Cch £6.60 6>24hrs	6.6	53.2	53.2
3	2012-12-01 08:27:01	YARR01	River Road 1 Yarmouth	01F	LS with Cch £6.60 6>24hrs	6.6	59.8	59.8
4	2012-12-01 08:34:11	YARR01	River Road 1 Yarmouth	01F	LS with Cch £6.60 6>24hrs	6.6	66.4	66.4

We can now plot the tariff based culumative totals as separate lines, splitting each line out using the colour aesthetic.

In [65]:

ggplot(df,aes(x="Date",y="Cash_cumul2",colour="Tariff"))+geom_line()

Out[65]:

<ggplot: (8760290711777)>

Faceted Charts¶

On other occasions, we may wish to split out data from different groups into different charts. This is referred to as faceting. We can split a datset across separate charts based on the value of a particular group attribute by using the facet_wrap() layer.

In [66]:

ggplot(df, aes(x="Date",y="Cash_cumul2")) + geom_line() \
                                   + ggtitle("Payments made over time") \
                                   + labs("Transaction Date", "Transaction amount (£)") \
                                   + facet_wrap("Tariff")

Out[66]:

<ggplot: (8760286052826)>

By default, axis values are generated for each chart independently. However, we can also force them to use the same axes by setting the scales parameter to fixed, as opposed to free.

In [67]:

ggplot(df, aes(x="Date",y="Cash_cumul2")) + geom_line() \
                                   + ggtitle("Payments made over time") \
                                   + labs("Transaction Date", "Transaction amount (£)") \
                                   + facet_wrap("Tariff",scales = "fixed")

Out[67]:

<ggplot: (8760285536211)>

Exercise¶

How many ticket machines is the data collected from and how many transactions are recorded by each one?

How would you generate a faceted chart showing the accumulated transactions over time for each of the ticket machines identified in the ticket column?

In [68]:

# Identifying the number of distinct machines and number of transactions recorded by each one

In [69]:

# Accumulated total for each machine

In [70]:

#Chart faceted by ticket machine

Themes¶

When it comes to publishing charts in a particular publication, we often require that the chart is presented in a particular style. In the same way that we can use different CSS style files to alter the look of a particular HTML document, so we can alter the look of a chart generated using ggplot by applying differnt themes to the chart.

For example, if we need to inject a little humour or apparent "casualness" into a chart, at the expense of some accuracy in the chart, we can use the XCKD theme.

(For more information about the use of such a theme, and techniques for creating such "sketchy visualiastions", see Wood, Jo, Petra Isenberg, Tobias Isenberg, Jason Dykes, Nadia Boukhelifa, and Aidan Slingsby. "Sketchy rendering for information visualization." IEEE Transactions on Visualization and Computer Graphics, 18(12), 2012: 2749-2758.)

In [71]:

p = ggplot(aes(x='Tariff',fill="Machine"), data=df)
p = p + geom_bar() + ggtitle("Number of Tickets per Tariff")  + labs("Tariff Code", "Count") 

p + theme_xkcd()

Out[71]:

<ggplot: (8760285384200)>

Another very useful theme is the "clean" looking theme_bw().

In [72]:

p + theme_bw()

Out[72]:

<ggplot: (8760286107582)>

What Next?¶

In this notebook,we have introduced some of the basic chart types that are supported by ggplot, as well as some of the modifications you can make to the charts. You can find further information, as well as descriptions of additioanl chart types, from the ggplot documentation.

Feel free to extend this notebook as your own personal reference notebook by adding further sections about additional chart types.

If you are working through this notebook as part of an inline exercise, return to the course materials now. If you are working through this set of notebooks as a whole, move on to 4.5.3 Getting Started With Maps - folium.