import pandas as pd
def getstore_and_print_table(fname):
import pprint
store = pd.HDFStore(fname)
pprint.pprint(store.get_storer('df').group.table)
return store
df = pd.DataFrame(randn(1e6,2),columns=list('AB'))
%%timeit
df.to_hdf('test.h5','df',data_columns=['A','B'],mode='w',table=True, index=True)
%%timeit
df.to_hdf('test.h5','df',data_columns=['A','B'],mode='w',table=True, index=False)
store = getstore_and_print_table('test.h5')
store
Also, that the selection works must have something to do with the fact that it has data_columns, even so I created the table with index=False
%timeit store.select('df',['B > 0.5', 'B < 1.6'])
%timeit store.select('df',['A<0.5','A>0.0'])
%timeit store.create_table_index('df',columns=['B'],kind='full')
store.get_storer('df').group.table
No improvement just by creating an index. IIUC, that's because data_columns, that had been miracously created, even so I saved with index=False, had created an index automatically, just not 'full' as required by ptrepack --sortby:
%timeit store.select('df',['B > 0.5', 'B < 1.6'])
%timeit store.select('df',['A<0.5','A>0.0'])
store.close()
!ptdump -v test.h5
%timeit !ptrepack --chunkshape=auto --sortby=B -o test.h5 test_sorted_noprop.h5
!ptdump -v test_sorted_noprop.h5
store = getstore_and_print_table('test_sorted_noprop.h5')
store
%timeit store.select('df',['B > 0.5', 'B < 1.6'])
try:
%timeit store.select('df',['A<0.5','A>0.0'])
except ValueError as e:
print "ValueError:",e
store.close()
%timeit !ptrepack --chunkshape=auto --sortby=B --propindexes -o test.h5 test_sorted.h5
!ptdump -v test_sorted.h5
store = getstore_and_print_table('test_sorted.h5')
%timeit store.select('df',['B > 0.5','B < 1.6'])
try:
%timeit store.select('df',['A<0.5','A>0.0'])
except ValueError as e:
print "ValueError:",e
store.close()
Compression (see next, here done at level 5, I also tested 9) doesn't make timing much worse, but certainly not better also. Interestingly, it only took marginally more than the ptrepack with --propindexes which means that that option is dominating the ptrepacking time. The filesizes for my examples were:
So my conclusion is, that doing things without index at data collection saves a lot of time, and even quite some space, without even using compression!
%timeit !ptrepack --chunkshape=auto --sortby=B --propindexes --complib=blosc --complevel=5 -o test.h5 test_sorted_compressed.h5
!ptdump -v test_sorted_compressed.h5
store = getstore_and_print_table('test_sorted_compressed.h5')
%timeit store.select('df',['B > 0.5','B < 1.6'])
store.close()