In [1]:

import pandas as pd

In [2]:

def getstore_and_print_table(fname):
    import pprint
    store = pd.HDFStore(fname)
    pprint.pprint(store.get_storer('df').group.table)
    return store

In [3]:

df = pd.DataFrame(randn(1e6,2),columns=list('AB'))

In [4]:

%%timeit
df.to_hdf('test.h5','df',data_columns=['A','B'],mode='w',table=True, index=True)

1 loops, best of 3: 8.36 s per loop

Note, how much faster it becomes without creating an index!¶

In [5]:

%%timeit
df.to_hdf('test.h5','df',data_columns=['A','B'],mode='w',table=True, index=False)

1 loops, best of 3: 1.28 s per loop

In [6]:

store = getstore_and_print_table('test.h5')

/df/table (Table(1000000,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": Float64Col(shape=(), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (5461,)

The store object still has indexers, I don't really understand how or why. I guess I really should read some pytables intro, sorry Jeff! :-)¶

In [7]:

store

Out[7]:

<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df            frame_table  (typ->appendable,nrows->1000000,ncols->2,indexers->[index],dc->[A,B])

The following %timeit result is independent of having data_columns=list('AB') or data_columns=['B'].¶

Also, that the selection works must have something to do with the fact that it has data_columns, even so I created the table with index=False

In [8]:

%timeit store.select('df',['B > 0.5', 'B < 1.6'])

1 loops, best of 3: 1.02 s per loop

In [9]:

%timeit store.select('df',['A<0.5','A>0.0'])

1 loops, best of 3: 641 ms per loop

Here I tested if it makes any difference in creating a optlevel 9 index. I found it negligible. The only important thing is, that this index HAS to be 'full', otherwise ptrepack --sortby will bail.¶

In [10]:

%timeit store.create_table_index('df',columns=['B'],kind='full')

1 loops, best of 3: 613 µs per loop

In [11]:

store.get_storer('df').group.table

Out[11]:

/df/table (Table(1000000,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": Float64Col(shape=(), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (5461,)
  autoIndex := True
  colindexes := {
    "B": Index(6, full, shuffle, zlib(1)).is_CSI=True}

No improvement just by creating an index. IIUC, that's because data_columns, that had been miracously created, even so I saved with index=False, had created an index automatically, just not 'full' as required by ptrepack --sortby:

In [12]:

%timeit store.select('df',['B > 0.5', 'B < 1.6'])

1 loops, best of 3: 1.05 s per loop

In [13]:

%timeit store.select('df',['A<0.5','A>0.0'])

1 loops, best of 3: 643 ms per loop

In [14]:

store.close()

In [15]:

!ptdump -v test.h5

/ (RootGroup) ''
/df (Group) ''
/df/table (Table(1000000,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": Float64Col(shape=(), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (5461,)
  autoIndex := True
  colindexes := {
    "B": Index(6, full, shuffle, zlib(1)).is_CSI=True}

The ptrepack works with a standard Index level 6 as well, the only requirement I found is, that it absolutely has to be a 'full' index, as created above.¶

In [28]:

%timeit !ptrepack --chunkshape=auto --sortby=B -o test.h5 test_sorted_noprop.h5

1 loops, best of 3: 4.29 s per loop

See how the lack of --propindex removes the autoindex=True and colindexes part again, as it was before creating the index above in the storer().group.table print-out:¶

In [17]:

!ptdump -v test_sorted_noprop.h5

/ (RootGroup) ''
/df (Group) ''
/df/table (Table(1000000,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": Float64Col(shape=(), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (5461,)

The output of get_storer('df').group.table is exactly the same as the ptdump:¶

In [18]:

store = getstore_and_print_table('test_sorted_noprop.h5')

/df/table (Table(1000000,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": Float64Col(shape=(), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (5461,)

In [19]:

store

Out[19]:

<class 'pandas.io.pytables.HDFStore'>
File path: test_sorted_noprop.h5
/df            frame_table  (typ->appendable,nrows->1000000,ncols->2,indexers->[index],dc->[A,B])

But the timing of the selection has improved by factor 5!¶

In [20]:

%timeit store.select('df',['B > 0.5', 'B < 1.6'])

1 loops, best of 3: 188 ms per loop

And even with no --propindex the selection for column 'A' keeps feasible (which I wasn't sure before):¶

In [21]:

try:
    %timeit store.select('df',['A<0.5','A>0.0'])
except ValueError as e:
    print "ValueError:",e

1 loops, best of 3: 626 ms per loop

In [22]:

store.close()

The repacking with --propindex takes factor 7/4.3 longer, with a slightly faster data selection below:¶

In [27]:

%timeit !ptrepack --chunkshape=auto --sortby=B --propindexes -o test.h5 test_sorted.h5

1 loops, best of 3: 7.08 s per loop

In [29]:

!ptdump -v test_sorted.h5

/ (RootGroup) ''
/df (Group) ''
/df/table (Table(1000000,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": Float64Col(shape=(), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (5461,)
  autoIndex := True
  colindexes := {
    "B": Index(6, full, shuffle, zlib(1)).is_CSI=True}

In [30]:

store = getstore_and_print_table('test_sorted.h5')

/df/table (Table(1000000,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": Float64Col(shape=(), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (5461,)
  autoIndex := True
  colindexes := {
    "B": Index(6, full, shuffle, zlib(1)).is_CSI=True}

In [31]:

%timeit store.select('df',['B > 0.5','B < 1.6'])

10 loops, best of 3: 172 ms per loop

In [32]:

try:
    %timeit store.select('df',['A<0.5','A>0.0'])
except ValueError as e:
    print "ValueError:",e

1 loops, best of 3: 627 ms per loop

So one has to decide if one wants to spend the almost factor 2 longer ptrepack with the benefit of slightly faster selection.¶

In [33]:

store.close()

Compression (see next, here done at level 5, I also tested 9) doesn't make timing much worse, but certainly not better also. Interestingly, it only took marginally more than the ptrepack with --propindexes which means that that option is dominating the ptrepacking time. The filesizes for my examples were:

test.h5 31 M
test_sorted_noprop.h5 24 M
test_sorted.h5 29 M
test_sorted_compressed.h5 22 M

So my conclusion is, that doing things without index at data collection saves a lot of time, and even quite some space, without even using compression!

In [34]:

%timeit !ptrepack --chunkshape=auto --sortby=B --propindexes --complib=blosc --complevel=5 -o test.h5 test_sorted_compressed.h5

1 loops, best of 3: 7.21 s per loop

In [35]:

!ptdump -v test_sorted_compressed.h5

/ (RootGroup) ''
/df (Group) ''
/df/table (Table(1000000,), shuffle, blosc(5)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": Float64Col(shape=(), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (5461,)
  autoIndex := True
  colindexes := {
    "B": Index(6, full, shuffle, zlib(1)).is_CSI=True}

In [36]:

store = getstore_and_print_table('test_sorted_compressed.h5')

/df/table (Table(1000000,), shuffle, blosc(5)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": Float64Col(shape=(), dflt=0.0, pos=1),
  "B": Float64Col(shape=(), dflt=0.0, pos=2)}
  byteorder := 'little'
  chunkshape := (5461,)
  autoIndex := True
  colindexes := {
    "B": Index(6, full, shuffle, zlib(1)).is_CSI=True}

In [37]:

%timeit store.select('df',['B > 0.5','B < 1.6'])

1 loops, best of 3: 191 ms per loop

In [38]:

store.close()

In [ ]: