In [1]:
import pandas as pd
In [2]:
def getstore_and_print_table(fname):
    import pprint
    store = pd.HDFStore(fname)
    pprint.pprint(store.get_storer('df').group.table)
    return store
In [3]:
df = pd.DataFrame(randn(1e6,2),columns=list('AB'))
In [4]:
%%timeit
df.to_hdf('test.h5','df',data_columns=['A','B'],mode='w',table=True, index=True)
1 loops, best of 3: 8.36 s per loop

Note, how much faster it becomes without creating an index!

In [5]:
%%timeit
df.to_hdf('test.h5','df',data_columns=['A','B'],mode='w',table=True, index=False)
1 loops, best of 3: 1.28 s per loop

In [6]:
store = getstore_and_print_table('test.h5')
/df/table (Table(1000000,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": Float64Col(shape=(), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (5461,)

The store object still has indexers, I don't really understand how or why. I guess I really should read some pytables intro, sorry Jeff! :-)

In [7]:
store
Out[7]:
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df            frame_table  (typ->appendable,nrows->1000000,ncols->2,indexers->[index],dc->[A,B])

The following %timeit result is independent of having data_columns=list('AB') or data_columns=['B'].

Also, that the selection works must have something to do with the fact that it has data_columns, even so I created the table with index=False

In [8]:
%timeit store.select('df',['B > 0.5', 'B < 1.6'])
1 loops, best of 3: 1.02 s per loop

In [9]:
%timeit store.select('df',['A<0.5','A>0.0'])
1 loops, best of 3: 641 ms per loop

Here I tested if it makes any difference in creating a optlevel 9 index. I found it negligible. The only important thing is, that this index HAS to be 'full', otherwise ptrepack --sortby will bail.

In [10]:
%timeit store.create_table_index('df',columns=['B'],kind='full')
1 loops, best of 3: 613 ┬Ás per loop

In [11]:
store.get_storer('df').group.table
Out[11]:
/df/table (Table(1000000,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": Float64Col(shape=(), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (5461,)
  autoIndex := True
  colindexes := {
    "B": Index(6, full, shuffle, zlib(1)).is_CSI=True}

No improvement just by creating an index. IIUC, that's because data_columns, that had been miracously created, even so I saved with index=False, had created an index automatically, just not 'full' as required by ptrepack --sortby:

In [12]:
%timeit store.select('df',['B > 0.5', 'B < 1.6'])
1 loops, best of 3: 1.05 s per loop

In [13]:
%timeit store.select('df',['A<0.5','A>0.0'])
1 loops, best of 3: 643 ms per loop

In [14]:
store.close()
In [15]:
!ptdump -v test.h5
/ (RootGroup) ''
/df (Group) ''
/df/table (Table(1000000,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": Float64Col(shape=(), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (5461,)
  autoIndex := True
  colindexes := {
    "B": Index(6, full, shuffle, zlib(1)).is_CSI=True}

The ptrepack works with a standard Index level 6 as well, the only requirement I found is, that it absolutely has to be a 'full' index, as created above.

In [28]:
%timeit !ptrepack --chunkshape=auto --sortby=B -o test.h5 test_sorted_noprop.h5
1 loops, best of 3: 4.29 s per loop

See how the lack of --propindex removes the autoindex=True and colindexes part again, as it was before creating the index above in the storer().group.table print-out:

In [17]:
!ptdump -v test_sorted_noprop.h5
/ (RootGroup) ''
/df (Group) ''
/df/table (Table(1000000,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": Float64Col(shape=(), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (5461,)

The output of get_storer('df').group.table is exactly the same as the ptdump:

In [18]:
store = getstore_and_print_table('test_sorted_noprop.h5')
/df/table (Table(1000000,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": Float64Col(shape=(), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (5461,)

In [19]:
store
Out[19]:
<class 'pandas.io.pytables.HDFStore'>
File path: test_sorted_noprop.h5
/df            frame_table  (typ->appendable,nrows->1000000,ncols->2,indexers->[index],dc->[A,B])

But the timing of the selection has improved by factor 5!

In [20]:
%timeit store.select('df',['B > 0.5', 'B < 1.6'])
1 loops, best of 3: 188 ms per loop

And even with no --propindex the selection for column 'A' keeps feasible (which I wasn't sure before):

In [21]:
try:
    %timeit store.select('df',['A<0.5','A>0.0'])
except ValueError as e:
    print "ValueError:",e
1 loops, best of 3: 626 ms per loop

In [22]:
store.close()

The repacking with --propindex takes factor 7/4.3 longer, with a slightly faster data selection below:

In [27]:
%timeit !ptrepack --chunkshape=auto --sortby=B --propindexes -o test.h5 test_sorted.h5
1 loops, best of 3: 7.08 s per loop

In [29]:
!ptdump -v test_sorted.h5
/ (RootGroup) ''
/df (Group) ''
/df/table (Table(1000000,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": Float64Col(shape=(), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (5461,)
  autoIndex := True
  colindexes := {
    "B": Index(6, full, shuffle, zlib(1)).is_CSI=True}

In [30]:
store = getstore_and_print_table('test_sorted.h5')
/df/table (Table(1000000,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": Float64Col(shape=(), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (5461,)
  autoIndex := True
  colindexes := {
    "B": Index(6, full, shuffle, zlib(1)).is_CSI=True}

In [31]:
%timeit store.select('df',['B > 0.5','B < 1.6'])
10 loops, best of 3: 172 ms per loop

In [32]:
try:
    %timeit store.select('df',['A<0.5','A>0.0'])
except ValueError as e:
    print "ValueError:",e
1 loops, best of 3: 627 ms per loop

So one has to decide if one wants to spend the almost factor 2 longer ptrepack with the benefit of slightly faster selection.

In [33]:
store.close()

Compression (see next, here done at level 5, I also tested 9) doesn't make timing much worse, but certainly not better also. Interestingly, it only took marginally more than the ptrepack with --propindexes which means that that option is dominating the ptrepacking time. The filesizes for my examples were:

  • test.h5 31 M
  • test_sorted_noprop.h5 24 M
  • test_sorted.h5 29 M
  • test_sorted_compressed.h5 22 M

So my conclusion is, that doing things without index at data collection saves a lot of time, and even quite some space, without even using compression!

In [34]:
%timeit !ptrepack --chunkshape=auto --sortby=B --propindexes --complib=blosc --complevel=5 -o test.h5 test_sorted_compressed.h5
1 loops, best of 3: 7.21 s per loop

In [35]:
!ptdump -v test_sorted_compressed.h5
/ (RootGroup) ''
/df (Group) ''
/df/table (Table(1000000,), shuffle, blosc(5)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": Float64Col(shape=(), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (5461,)
  autoIndex := True
  colindexes := {
    "B": Index(6, full, shuffle, zlib(1)).is_CSI=True}

In [36]:
store = getstore_and_print_table('test_sorted_compressed.h5')
/df/table (Table(1000000,), shuffle, blosc(5)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": Float64Col(shape=(), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (5461,)
  autoIndex := True
  colindexes := {
    "B": Index(6, full, shuffle, zlib(1)).is_CSI=True}

In [37]:
%timeit store.select('df',['B > 0.5','B < 1.6'])
1 loops, best of 3: 191 ms per loop

In [38]:
store.close()
In []:
 
Back to top