In a 1973 paper, *Graphs in Statistical Analysis*, published in *The American Statistician*, Vol. 27, No. 1. (Feb., 1973), pp. 17-21, statistician Francis Anscombe provided the briefiest of abstracts: *"Graphs are essential to good statistical analysis"*.

His paper opened with a brief meditation on *the usefulness of graphs*:

*Most textbooks on statistical methods, and most statistical computer programs, pay too little attention to graphs. Few of us escape being indoctrinated with these notions:*

- numerical calculations are exact, but graphs are rough;
- for any particular kind of statistical data there is just one set of calculations constituting a correct statistical analysis;
- performing intricate calculations is virtuous, whereas actually looking at the data is cheating.

A computer should make both calculations and graphs. Both sorts of output should be studied; each will contribute to understanding.

Graphs can have various purposes, such as: (i) to help us perceive and appreciate some broad features of the data, (ii) to let us look behind those broad features and see what else is there. Most kinds of statistical calculation rest on assumptions about the behavior of the data. Those assumptions may be false, and then the calculations may be misleading. We ought always to try to check whether the assumptions are reasonably correct; and if they are wrong we ought to be able to perceive in what ways they are wrong. Graphs are very valuable for these purposes.

Good statistical analysis is not a purely routine matter, and generally calls for more than one pass through the computer. The analysis should be sensitive both to peculiar features in the given numbers and also to whatever background information is available about the variables. The latter is particularly helpful in suggesting alternative ways of setting up the analysis. Thought and ingenuity devoted to devising good graphs are likely to pay off. Many ideas can be gleaned from the literature...</em>

*x* and *y* value each) intended to demonstrate the usefulness of looking at graphs.

In [ ]:

```
import pandas as pd
```

In [ ]:

```
aq=pd.read_csv('data/anscombesQuartet_hier.csv',header=[0,1],index_col=[0])
aq
```

*x* and *y* across the groups, the variances were all but indistinguishable in any meaningful sense of the term.

In [ ]:

```
aq.mean()
```

In [ ]:

```
aq.var()
```

Other statistical properties, such as regression lines, were also the same.

So from these summary statistics, we conclude that the datasets are to all intents and purposes the same.

But what if we look at them?

The most natural way to look at this data is to use a *scatterplot*, with the *x*-values places along a continuous horizontal axis and the *y*-values against the vertical axis. Points are plotted as marks using their *x* and *y* values within each group as the Cartesian co-ordinates for each point.

Using *ggplot*, we can construct the plot directly from the dataframe, if it's correctly shaped...

The easiest way to plot this quartet id to generate a dataframe that has a column containing the group number and then columns for corresponding *x* and *y* pairs. We can then generate what is known as a *faceted plot* over the groups, generating one chart per group.

In order to do this, we need to *reshape* the dataframe. One way of doing this would be to use OpenRefine, and a combination of *transpose* operations. (If you would like to try this, I have provided the data for Anscombe's quartet in another form that is perhaps easier to read in to OpenRefine: Anscombe's quartet - simple CSV; see a rough draft walkthrough of how to reshape Anscombe's Quartet using OpenRefine here Another approach would be to use *pandas*, as we shall see here.

It's easy enough to melt the original dataframe into a long form that we can then reshape back to slightly wider form, but we also need to create a new index column that will allow us to align the data within each group without giving a duplicate index clash.

One solution is to generate a sequence of index values from 0..10 for each set of *x* and *y* values. This will be used along with the *group* value to create index values in the unmelted dataframe.

In [ ]:

```
tmp=pd.melt(aq)
tmp['index']=list(range(int(len(tmp)/8)))*8
tmp[8:15]
```

If we set a hierarchical index on the *group*, *index* and *var* columns, we can then `unstack()`

the final *var* column.

We then need to tidy up the column names to remove the upper hierarchical level.

In [ ]:

```
df=tmp.set_index(['group','index', 'var']).unstack()
df.columns = [col[1].strip() for col in df.columns.values]
df[7:15]
```

Finally, we can reset the index to give us the simple dataframe representation we require.

In [ ]:

```
df.reset_index(inplace=True)
df[7:15]
```

We are now in a position to plot the data.

In [ ]:

```
from ggplot import *
```

In [ ]:

```
ggplot(aes(x='x', y='y'), data=df)+facet_wrap('group',scales='fixed')+stat_smooth(method='lm',se=False)
```

We can also look to see how the regression lines compare with 95% confidence limits.

In [ ]:

```
ggplot(aes(x='x', y='y'), data=df)+facet_wrap('group',scales='fixed')+stat_smooth(method='lm')
```

Now we're starting to see some differences...

Finally, let's look at the actually data points themselves:

In [ ]:

```
ggplot(aes(x='x', y='y'), data=df)+geom_point()+facet_wrap('group')
```

*x* and *y* values across each group may be the same, and a quick look at the data tables hard to picture with any degree of certainity, but when visualised as a whole, each group of data clearly tells a different story.

Working back from the *ggplot* commands, we see how striaghtforward it can be to generate what is quite a complex plot from a relatively simple command. However, in order to be able to "write" this chart, or set of charts, we need to get the data into the right sort of shape. And that may be quite an involved process.

In may situations, *preparing* the data (which may include *cleaning* it) may take much more time than the actual analysis or visualisation. But that is the price we pay for being able to use such powerful analysis and visualisation tools.

Anscombe concluded his paper as follows:

Computational techniques have moved on somewhat since 1973, of course, and times have indeed changed. Graphical displays are everywhere, and libraries such as *ggplot* that are rooted in a sound *grammatical* basis mean that we are now in a position to "write" very powerful expressions that can generate statistical graphics for us, directly from a cleaned and prepared dataset, using just a few well chosen phrases. But getting the data into the right shape may stull require significant amounts of *trouble, cunning and a fighting spirit*.

May the visualisations begin...

**DO NOT REMOVE THIS NOTICE/CELL FROM THIS IPYTHON NOTEBOOK**

**This notebook was prepared for use in a course on data analysis and data management due to be released by The Open University in October 2015 and is made available AS IS, and IN DRAFT FORM ONLY. It may be used for educational purposes only.**

*Comments to: tony.hirst@open.ac.uk*