Yesterday I showed how Blaze uses Pandas to query collections of large CSV files. (notebook here). If you're able to drive Pandas intelligently then we can query a 16GB dataset in around 3 minutes. This was about as fast as Postgres after the CSV files had been loaded in (a process that takes about 20 minutes if done right.)
The fact that Pandas+CSV is competitive with Postgres is startling because CSV is a very low-tech and expensive storage solution. If we replace CSV with a faster solution then Pandas is able to really shine.
We compare the performance of our query between a stack of CSV files and a BColz array.
from blaze import *
d = Data('trip_data_*.csv')
expr = by(d.passenger_count, avg_distance=d.trip_distance.mean(),
count=d.passenger_count.count())
%time _ = compute(expr)
CPU times: user 2min 12s, sys: 14 s, total: 2min 26s Wall time: 3min 10s
d = Data('trip_data.bcolz')
expr = by(d.passenger_count, avg_distance=d.trip_distance.mean(),
count=d.passenger_count.count())
%time _ = compute(expr)
CPU times: user 16.7 s, sys: 3.91 s, total: 20.6 s Wall time: 20.7 s
Disclaimer: This may have benefitted from OS-level caching. These times range from 20s to 40s.
import multiprocessing
pool = multiprocessing.Pool(4)
%time _ = compute(expr, map=pool.map)
CPU times: user 169 ms, sys: 13.2 ms, total: 182 ms Wall time: 7.26 s
The time difference above strongly motivates the use of efficient binary storage systems. Unfortunately migrating data between formats is usually hard enough that everyday users don't bother.
The into
project eases this process. In principle into
lets us write something as naive as the following:
into('trip.bcolz', 'trip_data_*.csv')
Into works for a variety of formats, not just csv and bcolz.
We can often improve performance by specifying more information. In our case the text columns in our dataset happen to have fixed lengths (they are just hash values.) By specifying these lengths explicilty we help BColz store and retrieve the data more effectively. We specify this additional information with a datashape
specifying the column types exactly.
In just a moment we'll talk about how to automatically detect datashapes.
ds = dshape("""var * {
medallion: string[32, 'ascii'],
hack_license: string[32, 'ascii'],
vendor_id: string[3, 'ascii'],
rate_code: int32,
store_and_fwd_flag: string[1, 'ascii'],
pickup_datetime: datetime,
dropoff_datetime: datetime,
passenger_count: int32,
trip_time_in_secs: int32,
trip_distance: float64,
pickup_longitude: float64,
pickup_latitude: float64,
dropoff_longitude: float64,
dropoff_latitude: float64
}""")
# Drop old version
drop('trip_data.bcolz')
# Migrate data
%time into('trip_data.bcolz', 'trip_data_*.csv', dshape=ds)
CPU times: user 16min 16s, sys: 1min 54s, total: 18min 10s Wall time: 19min 15s
ctable((173179759,), [('medallion', 'S32'), ('hack_license', 'S32'), ('vendor_id', 'S3'), ('rate_code', '<i4'), ('store_and_fwd_flag', 'S1'), ('pickup_datetime', '<M8[us]'), ('dropoff_datetime', '<M8[us]'), ('passenger_count', '<i4'), ('trip_time_in_secs', '<i4'), ('trip_distance', '<f8'), ('pickup_longitude', '<f8'), ('pickup_latitude', '<f8'), ('dropoff_longitude', '<f8'), ('dropoff_latitude', '<f8')]) nbytes: 21.93 GB; cbytes: 16.43 GB; ratio: 1.34 cparams := cparams(clevel=5, shuffle=True, cname='blosclz') rootdir := 'trip_data.bcolz' [ ('89D227B655E5C82AECF13C3F540D4CF4', 'BA96DE419E711691B9445D6A6307C170', 'CMT', 1, 'N', datetime.datetime(2013, 1, 1, 15, 11, 48), datetime.datetime(2013, 1, 1, 15, 18, 10), 4, 382, 1.0, -73.978165, 40.757977000000004, -73.989838, 40.751171) ('0BD7C8F5BA12B88E0B67BED28BEA73D8', '9FD8F69F0804BDB5549F40E9DA1BE472', 'CMT', 1, 'N', datetime.datetime(2013, 1, 6, 0, 18, 35), datetime.datetime(2013, 1, 6, 0, 22, 54), 1, 259, 1.5, -74.006683, 40.731781, -73.994499, 40.750659999999996) ('0BD7C8F5BA12B88E0B67BED28BEA73D8', '9FD8F69F0804BDB5549F40E9DA1BE472', 'CMT', 1, 'N', datetime.datetime(2013, 1, 5, 18, 49, 41), datetime.datetime(2013, 1, 5, 18, 54, 23), 1, 282, 1.1, -74.004707, 40.737770000000005, -74.009834, 40.726002) ..., ('B9751FE71B59FA04240FE5979963185E', 'DD35F09BF4591EA47D56A56D014B12FD', 'CMT', 1, 'N', datetime.datetime(2013, 9, 6, 16, 17, 12), datetime.datetime(2013, 9, 6, 16, 26, 30), 1, 558, 1.4, -73.954323, 40.781231, -73.967674, 40.762943) ('7C7EA825806A49F10411C1F6656E4B6B', '4B12AC33E74C663162B0850FCE5413E7', 'CMT', 1, 'N', datetime.datetime(2013, 9, 7, 8, 40, 49), datetime.datetime(2013, 9, 7, 9, 0, 3), 1, 1154, 10.8, -74.00057199999999, 40.740673, -73.861877, 40.768436) ('BCC1B5B7D9596FEC872C9DAD725310B3', '246735D51D474F0B3A93F431CE5A3127', 'CMT', 1, 'N', datetime.datetime(2013, 9, 6, 9, 47, 50), datetime.datetime(2013, 9, 6, 10, 2, 24), 1, 874, 2.1, -74.014252, 40.717148, -73.993431, 40.733189)]
Don't worry, we didn't write that by hand. The discover
function gives you the datashape of most anything, including our original CSV files
d = Data('trip_data_*.csv')
discover(d)
dshape("""var * { medallion: string, hack_license: string, vendor_id: string, rate_code: int64, store_and_fwd_flag: ?string, pickup_datetime: datetime, dropoff_datetime: datetime, passenger_count: int64, trip_time_in_secs: int64, trip_distance: float64, pickup_longitude: float64, pickup_latitude: float64, dropoff_longitude: float64, dropoff_latitude: float64 }""")
We just copy-pasted this, rewriting string
with string[32, 'ascii']
in a few places.
Last time we showed that your laptop can easily process 10+GB files using Pandas+ElbowGrease or Pandas+Blaze .
Here we show that using EfficientStorage+Pandas+Blaze your single laptop can process 10+GB files in seconds. Data in this range is quite convenient if you or your tools manage things well.