IPython and Pandas Blocks Demo

For a new OU course on all matters data that we're working on at the moment, we're exploring the use of IPython notebooks and the pandas library.

A few weeks ago, I came across another Python library called ipythonblocks [demo] that's been developed as a teaching tool around Python programming. The library provides an HTML grid containing squares whose colours are controlled by the programmer.

When I first saw caught sight of this library, I thought it looked ideal as a tool for helping visualise the state of pandas dataframes and various transformations we might apply to them. So here are a few proof of concept doodles... Comments much appreciated via the partner blogpost to this gist/notebook - Visualising Pandas DataFrames With IPythonBlocks – Proof of Concept.

Getting Started

I'm using Anaconda as my Python/ipynb/matplotlib distro, which is a real pain to install things into. Easiest way to get the ipythonblocks library running is to download the .py file from github and install it locally...

In [265]:
from ipythonblocks import BlockGrid
In [266]:
#Here's a simple grid
grid = BlockGrid(10, 10, fill=(123, 234, 123))
grid
Out[266]:

Introducing pandas DateFrames

In [267]:
import pandas as pd
In [268]:
df=pd.DataFrame({'group':['a','a','a','a','a','a','a','b','b','b','b','b','b','b'],
                 'day':['Mon','Tues','Fri','Thurs','Sat','Sun','Weds','Fri','Sun','Thurs','Sat','Weds','Mon','Tues'],
                 'amount':[1,2,4,2,1,1,2,4,5,3,4,2,1,3],
                 'value':[2.1,np.nan,3,4.4,2.5,1,np.nan,np.nan,8.2,3.2,4.1,np.nan,1.2,1.4]
                 })
In [269]:
df
Out[269]:
amount day group value
0 1 Mon a 2.1
1 2 Tues a NaN
2 4 Fri a 3.0
3 2 Thurs a 4.4
4 1 Sat a 2.5
5 1 Sun a 1.0
6 2 Weds a NaN
7 4 Fri b NaN
8 5 Sun b 8.2
9 3 Thurs b 3.2
10 4 Sat b 4.1
11 2 Weds b NaN
12 1 Mon b 1.2
13 3 Tues b 1.4

14 rows × 4 columns

We can get the shape of a dataframe easily enough:

In [270]:
df.shape
Out[270]:
(14, 4)

This gives us (rows,columns). So let's visualise that:

In [271]:
def pBlockGrid(df):
    (y,x)=df.shape
    return BlockGrid(x,y)
In [272]:
pBlockGrid(df)
Out[272]:

We can add in some colour:

In [273]:
def pBlockGrid2(df,fill=(123, 234, 123),*args,**kwargs):
    (y,x)=df.shape
    b=BlockGrid(x,y,fill=fill,**kwargs)
    return b

There's also an opportunity to tweak the blocksize...

In [274]:
pb=pBlockGrid2(df)
pb.block_size = 4
pb
Out[274]:

We can check the type of each column in the pandas dataframe:

In [275]:
df.dtypes
Out[275]:
amount      int64
day        object
group      object
value     float64
dtype: object

Let's start to tweak the function so we can display the columns by data type:

In [276]:
def pBlockGrid3(df,fill=(123, 234, 123),blocksize=10,*args,**kwargs):
    b=pBlockGrid2(df,fill=fill)
    type_colours={'int64':(255, 0, 0),'float64':(0, 0, 255),'object':(128,128,0)}
    for i,j in enumerate(df.dtypes):
        b[:, i] = type_colours[j.name]
    b.block_size = blocksize
    return b
In [277]:
pBlockGrid3(df)
Out[277]:

Some of the cells are NaN or unknown values - let's detect them and colour them black:

In [278]:
def pBlock_colourNA(df,b,color_NA=(0,0,0),*args,**kwargs):
    (y,x)=df.shape
    for i in range(x):
        for j in range(y):
            if pd.isnull(df.iloc[j,i]):
                b[j,i]=color_NA
    return b

def pBlockGrid4(df,fill=(123, 234, 123),*args,**kwargs):
    b=pBlockGrid3(df,fill=fill)
    b=pBlock_colourNA(df,b,*args,**kwargs)
    return b
In [279]:
pBlockGrid4(df,blocksize=5)
Out[279]:

If we transpose the dataframe, we should be able to see the change in shape:

In [280]:
df.transpose()
Out[280]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13
amount 1 2 4 2 1 1 2 4 5 3 4 2 1 3
day Mon Tues Fri Thurs Sat Sun Weds Fri Sun Thurs Sat Weds Mon Tues
group a a a a a a a b b b b b b b
value 2.1 NaN 3 4.4 2.5 1 NaN NaN 8.2 3.2 4.1 NaN 1.2 1.4

4 rows × 14 columns

In [281]:
pBlockGrid4(df.transpose(),blocksize=5)
Out[281]:

Joining Dataframes: .concat()

Let's create a couple of dataframes that we can try appending together:

In [282]:
df1=pd.DataFrame({"colA":['A','B','C','D'],"colB":np.arange(4)})
df1
Out[282]:
colA colB
0 A 0
1 B 1
2 C 2
3 D 3

4 rows × 2 columns

In [283]:
df2=pd.DataFrame({"colA":['D','E','F'],"colB":np.arange(14,17)})
df2
Out[283]:
colA colB
0 D 14
1 E 15
2 F 16

3 rows × 2 columns

This is probably a bit clunky as far as "proper" Python hackers go, but this is still proof of concept...

What happens if we create a function that does a pandas .concat() and returns the resulting dataframe along with its block representation:

In [284]:
def pBlockGrid5(df1,df2,join='outer',fill=(123, 234, 123),*args,**kwargs):
    df=pd.concat([df1,df2],join=join)
    b=pBlockGrid2(df,fill=fill)
    for block in b:
        if block.row < len(df1): block.set_colors(255,0,0)
        else: block.set_colors(0,0,255)
    b=pBlock_colourNA(df,b,*args,**kwargs)
    return (df,b)

The idea behind the colouring is to show which dataframe provide what rows. The function should probably be generalised to accept a list of dataframes to merge, and then colour them all differently (which will require setting up a colour ramp).

In [285]:
dfd,bb=pBlockGrid5(df1,df2)
bb
Out[285]:

I'm wondering now: could the pandas DataFrame() object and associated operators (such as merge(), concat(), etc be extended to support ipythonblocks style visualisations?!

Speaking of merge, let's see if we can get something going there so we can inner, outer, left and right join dataframes - and visualise what happens as a result...

In [286]:
df3=pd.DataFrame({"colA":['F','G'],"colB":np.arange(16,18),"colC":np.arange(26,28)})
df3
Out[286]:
colA colB colC
0 F 16 26
1 G 17 27

2 rows × 3 columns

In [287]:
dfd,bb=pBlockGrid5(df1,df3)
bb
Out[287]:

This visualisation tells us that the top four rows include data from the first data frame, and the rightmost column for those rows are filled with NAs. The bottom two rows are complete and come from the second data frame; maybe the second dataframe introduced an additional column?

How about we just look at missing data?

In [288]:
b=pBlockGrid2(dfd,fill=(123, 234, 123))
pBlock_colourNA(dfd,b)
Out[288]:

Joining Dataframes: .merge()

The merge joins are akin to SQL joins - inner, outer, left and right.

I'm going to try to colour things so we can see which dataframe (left or right) the columns came in from...

In [236]:
def pBlockGrid6(df1,df2,fill=(123, 234, 123),*args,**kwargs):
    df=pd.merge(df1,df2,**kwargs)
    b=pBlockGrid2(df,fill=fill)
    for block in b:
        label=df.columns.tolist()[block.col]
        if 'on' in kwargs and label in kwargs['on']: block.set_colors(0,255,0)
        #This following bit assumes that the left N cols in the merged table are from the left table
        elif block.col < len(df1.columns): block.set_colors(255,0,0)
        else: block.set_colors(0,0,255)
    b=pBlock_colourNA(df,b)
    return (df,b)
In [237]:
dfd,bb=pBlockGrid6(df2,df3,how='inner',on=['colA','colB'])
print(dfd)
bb
  colA  colB  colC
0    F    16    26

[1 rows x 3 columns]
Out[237]:

So we read green as common join columns, blue as right dataframe only cells, and red as left dataframe only data cells.

In [238]:
dfd,bb=pBlockGrid6(df2,df3,how='left',on=['colA'])
print(dfd)
bb
  colA  colB_x  colB_y  colC
0    D      14     NaN   NaN
1    E      15     NaN   NaN
2    F      16      16    26

[3 rows x 4 columns]
Out[238]:
In [239]:
dfd,bb=pBlockGrid6(df2,df3,how='right',on=['colA'])
print(dfd)
bb
  colA  colB_x  colB_y  colC
0    F      16      16    26
1    G     NaN      17    27

[2 rows x 4 columns]
Out[239]:

Getting the logic for colouring the outer join clear defeats me for now...:-( I'm not totally convinced it'll always work? Maybe it would help my thinking if we reorder the columns so that columns contributed by the right table are on the right of the joined dataframe, then we should be able to colour columns we know have come from the right table?

Anyway, let's give it a go...

In [289]:
dfd,bb=pBlockGrid6(df2,df3,how='outer',on=['colB'])
print(dfd)
bb
  colA_x  colB colA_y  colC
0      D    14    NaN   NaN
1      E    15    NaN   NaN
2      F    16      F    26
3    NaN    17      G    27

[4 rows x 4 columns]
Out[289]:

Seems to work.. erm, maybe... But I suspect things will start breaking if we allow on_left and on-right join column declarations...:-(

But the principle is there...

And once again I wonder - could this sort of visualisation be built into pandas as a useful teaching tool? How easy would it be to create a patch library to extend pandas with this sort of functionality?

For example, here are some thoughts on a possible pandas interface:

  • DataFrame().blocks() to show the blocks
  • .cat(blocks=True) and .merge(blocks=True) to return (df, blocks)
  • DataFrame().blocks(blockProperties={}) and eg .merge(blocks=True, blockProperties={})
  • blockProperties: showNA=True|False, color_base=(), color_NA=(), color_left=(), color_right=(), color_gradient=[] (eg for a .cat() on many dataframes), colorView=structure|datatypes|missing (the colorView reveals the datatypes of the columns, the structure origins of cells returned from a .merge() or .cat(), or a view of missing data (reveal NA/NaN etc over a base color), colorTypes={} (to set the colors for different datatype)