Often, we want to create a copy of a dictionary and update it. For example, we have a default set of parameters. We want to update them, but without disturbing the original parameter list.
There are 2 ways of doing this.
.items()
list if both dictionaries, and make a new dict
out of it.Here is the benchmark:
update append
1. make a copy and update 1.58µs 1.66µs
2. add .items() and dictify 2.81µs 3.36µs
base = {x:x for x in range(20)}
same = {x:x for x in range(20)}
incr = {x:x for x in range(20, 40)}
%timeit y=dict(base); y.update(same)
%timeit y=dict(base); y.update(incr)
base = base.items()
same = same.items()
incr = incr.items()
%timeit dict(base + same)
%timeit dict(base + incr)
1000000 loops, best of 3: 1.58 µs per loop 1000000 loops, best of 3: 1.66 µs per loop 100000 loops, best of 3: 2.81 µs per loop 100000 loops, best of 3: 3.36 µs per loop
Array joins are faster than successive appending
1,000 10,000 100,000 1,000,000
appending 226µs 2.87ms 15.8ms 484ms
array joins 116µs 1.11ms 11.2ms 146ms
def string_append(s, count):
result = ''
for x in range(count):
result += s
def array_join(s, count):
result = []
for x in range(count):
result.append(s)
result = ''.join(result)
for count in [1000, 10000, 100000, 1000000]:
print count, 'concatenations'
%timeit string_append('abc', count)
%timeit array_join('abc', count)
1000 concatenations 1000 loops, best of 3: 226 µs per loop 10000 loops, best of 3: 116 µs per loop 10000 concatenations 1000 loops, best of 3: 2.87 ms per loop 1000 loops, best of 3: 1.11 ms per loop 100000 concatenations 100 loops, best of 3: 15.8 ms per loop 100 loops, best of 3: 11.2 ms per loop 1000000 concatenations 1 loops, best of 3: 484 ms per loop 10 loops, best of 3: 146 ms per loop
Date parsing in Python is quite slow, especially for large arrays. Here's a benchmark of various approaches.
to_datetime: 7740 ms
dateutil: 6970 ms
strptime: 1660 ms
manual: 253 ms
lookup: 9 ms
Manual string-array-based parsing of dates is significantly faster. If there aren't many dates, lookups are MUCH faster.
import time
import datetime
import dateutil.parser
import pandas as pd
s = pd.Series(['01-31-2012']*100000)
# Use Pandas' built-in to_datetime
%timeit pd.to_datetime(s)
1 loops, best of 3: 7.74 s per loop
# Use dateutil.parser
%timeit s.apply(dateutil.parser.parse)
1 loops, best of 3: 6.97 s per loop
# Parse using datetime.strptime
%timeit s.apply(lambda v: datetime.datetime.strptime(v, '%m-%d-%Y'))
1 loops, best of 3: 1.66 s per loop
# Manually parse the date
%timeit s.apply(lambda v: datetime.datetime(int(v[6:10]), int(v[0:2]), int(v[3:5])))
1 loops, best of 3: 253 ms per loop
def lookup(s):
"""
This is an extremely fast approach to datetime parsing.
For large data, the same dates are often repeated. Rather than
re-parse these, we store all unique dates, parse them, and
use a lookup to convert all dates.
"""
return s.map({date:pd.to_datetime(date) for date in s.unique()})
%timeit lookup(s)
100 loops, best of 3: 9.23 ms per loop
Mean is much (~100 times) faster to calculate than median.
import time
import numpy
data = numpy.random.rand(50000000)
timeit numpy.mean(data)
10 loops, best of 3: 61.7 ms per loop
timeit numpy.median(data)
1 loops, best of 3: 6.13 s per loop
HDF5 is the fastest way of reading tabular data.
csv.DictReader: 2.78 s
pickle: 2.41 s
json: 2.39 s
json-array: 799 ms
csv.reader: 478 ms
pd.read_csv 355 ms
pd.read_pickle: 319 ms
pd.read_hdf (table) 169 ms
pd.read_hdf (stored) 123 ms
# First, create a set of data files
words = 'ad adipisicing aliqua aliquip amet anim aute cillum commodo consectetur consequat culpa cupidatat deserunt do dolor dolore duis ea eiusmod elit enim esse est et eu ex excepteur exercitation fugiat id in incididunt ipsum irure labore laboris laborum lorem magna minim mollit nisi non nostrud nulla occaecat officia pariatur proident qui quis reprehenderit sed sint sit sunt tempor ullamco ut velit veniam voluptate'.split()
# Create the data in memory
data = []
for row in range(0, 1000000):
data.append({
'A': words[row % len(words)],
'B': chr(64 + (row % 62)),
'C': row,
'D': row + 1,
'E': row + 2,
'F': row + 3,
})
# Save CSV
import csv
keys = sorted(data[0].keys())
out = csv.DictWriter(open('sample.data.csv', 'w'),
fieldnames=keys,
lineterminator='\n')
out.writerow(dict(zip(keys, keys)))
out.writerows(data)
# Save JSON
import json
json.dump(data, open('sample.data.json', 'w'), separators= (',', ':'))
# Save JSON-array
import json
json.dump([data[0].keys()] + [row.values() for row in data],
open('sample.data-array.json', 'w'),
separators= (',', ':'))
# Save pickle
import cPickle as pickle
pickle.dump(data, open('sample.data.pickle', 'wb'), pickle.HIGHEST_PROTOCOL)
# Save pandas pickle
import pandas as pd
df = pd.DataFrame(data, columns=data[0].keys())
df.to_pickle('sample.data.pandas')
# Save HDF5
df.to_hdf('sample.data.h5', 'stored')
df.to_hdf('sample.data.h5', 'table', table=True)
import time
import csv
import json
import cPickle as pickle
import pandas as pd
%timeit list(csv.DictReader(open('sample.data.csv')))
%timeit pickle.load(open('sample.data.pickle', 'rb'))
%timeit json.load(open('sample.data.json'))
%timeit json.load(open('sample.data-array.json'))
%timeit list(csv.reader(open('sample.data.csv')))
%timeit pd.read_csv('sample.data.csv')
%timeit pd.read_pickle('sample.data.pandas')
%timeit pd.read_hdf('sample.data.h5', 'table')
%timeit pd.read_hdf('sample.data.h5', 'stored')
1 loops, best of 3: 2.78 s per loop 1 loops, best of 3: 2.41 s per loop 1 loops, best of 3: 2.39 s per loop 1 loops, best of 3: 799 ms per loop 1 loops, best of 3: 478 ms per loop 1 loops, best of 3: 355 ms per loop 1 loops, best of 3: 319 ms per loop 10 loops, best of 3: 169 ms per loop 10 loops, best of 3: 123 ms per loop
This is the time taken to generate a bar chart, in µs. The output could either be xml (etree) or text.
xml text
template 68 35
lxml 73 87
cElementTree 23 247
For string output, tornado templates are extremely fast. Even for etree, it's faster than using lxml directly. However, if you only want etree output and not string, cElementTree is faster.
To me, the template approach with lxml.fromstring appears optimal.
from IPython.display import HTML
from tornado import template
using_template = template.Template('''
<svg width="100" height="50">
{% for i, x in enumerate(series) %}
<rect x="{{ 10 * i }}" width="10" y="{{ 50 - 10 * x }}" height="{{ 10 * x }}" fill="#88f" stroke="#fff"/>
{% end %}
</svg>
''', autoescape=None).generate
HTML(using_template(series=[1,2,3,4,3,2,1]))
from lxml import etree
def using_lxml(series):
root = etree.Element('svg', width="100", height="50")
for i, x in enumerate(series):
rect = etree.SubElement(root, 'rect',
x = '%d' % (10 * i),
width = '10',
y = '%d' % (50 - 10 * x),
height = '%d' % (10 * x),
fill = '#88f',
stroke = '#fff')
return root
HTML(etree.tostring(using_lxml(series=[1,2,3,4,3,2,1])))
import xml.etree.cElementTree as cElementTree
def using_cElementTree(series):
root = cElementTree.Element('svg', width="100", height="50")
for i, x in enumerate(series):
rect = cElementTree.SubElement(root, 'rect',
x = '%d' % (10 * i),
width = '10',
y = '%d' % (50 - 10 * x),
height = '%d' % (10 * x),
fill = '#88f',
stroke = '#fff')
return root
HTML(cElementTree.tostring(using_cElementTree(series=[1,2,3,4,3,2,1])))
# Create etree output
%timeit etree.fromstring(using_template(series=[1,2,3,4,3,2,1]))
%timeit using_lxml(series=[1,2,3,4,3,2,1])
%timeit using_cElementTree(series=[1,2,3,4,3,2,1])
# Create string output
%timeit using_template(series=[1,2,3,4,3,2,1])
%timeit etree.tostring(using_lxml(series=[1,2,3,4,3,2,1]))
%timeit cElementTree.tostring(using_cElementTree(series=[1,2,3,4,3,2,1]))
10000 loops, best of 3: 67.5 µs per loop 10000 loops, best of 3: 72.8 µs per loop 10000 loops, best of 3: 22.6 µs per loop 10000 loops, best of 3: 34.6 µs per loop 10000 loops, best of 3: 87.2 µs per loop 1000 loops, best of 3: 247 µs per loop
Here, we're trying to find where a value fits in a list of numbers. For example, in the list [1, 3, 7, 9], the number 4 would be just after the 2nd element 3.
The summary is: use numpy.searchsorted()
-- it's blazingly fast.
37,000 µs For loop
5,790 µs Numpy filtering
2,270 µs Numpy filtering on sorted values
1,850 µs Numpy index search on sorted values
1 µs numpy.searchsorted()
Having read this post on Ruby being slow, I thought I'd check the same with Python. I got it running fairly fast, but there was one piece that was taking a fair bit of time: counting numbers in a range. Here's the slow version:
values = range(1000000)
def count(values, a, b):
count = 0
for x in values:
if a <= x <= b:
count += 1
return count
%timeit count(values, 250000, 750000)
10 loops, best of 3: 37 ms per loop
Of course, running a loop in Python for numbers is never a good idea. Let's move this to NumPy.
values = numpy.random.rand(1000000)
%timeit ((.25 <= values) & (values <= .75)).sum()
100 loops, best of 3: 5.79 ms per loop
That's not bad, but it could get a lot better. First, let's sort the values and try it.
values.sort()
%timeit ((.25 <= values) & (values <= .75)).sum()
100 loops, best of 3: 2.27 ms per loop
Just like that, it's faster. But we can do much better. Given that it's already sorted, what if we just found the index?
%timeit (values <= .75).argmin() - (.25 <= values).argmax()
1000 loops, best of 3: 1.85 ms per loop
A bit faster. It's wasteful of memory, though -- having to create 2 new arrays just to find the position of these two numbers. What if we searched for these?
%timeit numpy.searchsorted(values, .75) - numpy.searchsorted(values, .25)
1000000 loops, best of 3: 1.45 µs per loop
That's 1.45 microseconds. It's 25 thousand times faster than the original code, and four thousand times faster than the original NumPy code.
If there's one thing I keep re-learning, it's that there's always a faster way of doing it, and if you really want to, you'll probably find it.
The next power of 10 for 4 is 10^1. For 40, it's 10^2. For 400, it's 10^3. For 0.04, it's 10^-1. And so on.
Most methods of calculating it are fast enough.
data = pd.Series(10 ** (6 * np.random.rand(10000) - 3))
def iterative(v):
i = 1
if v > 1:
n = 0
while i < v:
i, n = i * 10, n + 1
else:
n = 1
while i > v:
i, n = i / 10., n - 1
return n
%timeit data.apply(iterative)
100 loops, best of 3: 16.3 ms per loop
%timeit numpy.ceil(numpy.log10(data))
10000 loops, best of 3: 124 µs per loop
%timeit data.apply(lambda v: numpy.ceil(numpy.log10(v)))
100 loops, best of 3: 19.2 ms per loop
Given a DataFrame like this:
A B C val
X X X 0
X X Y 1
X Y X 2
X Y Y 3
Y X X 4
Y X Y 5
Y Y X 6
Y Y Y 7
... create a DataFrame like this, with subtotals.
A B C val level
na na na 28 0
X na na 6 1
X X na 1 2
X X X 0 3
X X Y 1 3
X Y na 5 2
X Y X 2 3
X Y Y 3 3
Y na na 22 1
Y X na 9 2
Y X X 4 3
Y X Y 5 3
Y Y na 13 2
Y Y X 6 3
Y Y Y 7 3
data = pd.DataFrame({'A': list('XXXXYYYY'), 'B': list('XXYYXXYY'), 'C': list('XYXYXYXY'), 'val': range(8)})
groups = ['A', 'B', 'C']
def subtotal(data, groups, agg):
frames = []
for level in range(1, 1 + len(groups)):
frame = data.groupby(groups[:level], sort=False, as_index=False).agg(agg)
frame['level'] = level
frames.append(frame)
df = pd.concat(frames)
for group in groups:
df[group].fillna('', inplace=True)
return df.sort(groups).set_index(groups)
print subtotal(data, groups=groups, agg={'val': 'sum'})
level val A B C X 1 6 X 2 1 X 3 0 Y 3 1 Y 2 5 X 3 2 Y 3 3 Y 1 22 X 2 9 X 3 4 Y 3 5 Y 2 13 X 3 6 Y 3 7
This is faster than the existing layout.hierarchy
import layout
odi = pd.read_csv('d:/site/gramener.com/viz/autolyse/data/odi-batting.csv', dtype={'Runs':float})
groups = ['Weekday', 'Country', 'Player']
agg = {'Runs': 'sum'}
%timeit subtotal(odi, groups, agg)
%timeit list(layout.hierarchy(odi, groups, agg=agg, size=lambda df: df['Runs'].sum()))
10 loops, best of 3: 23.9 ms per loop 1 loops, best of 3: 294 ms per loop
# stack(series, groupby)
I'm trying to see how fast numba is. autojit(fn)
makes fn
faster. numpy.sum
and @autojit
take about the same time. Python loops are much slower.
Looks like @autojit is a decent replacement for numpy.vectorize
.
from numba import autojit
def slow_sum(arr):
M, N = arr.shape
result = 0.0
for i in range(M):
for j in range(N):
result += arr[i,j]
return result
fast_sum = autojit(slow_sum)
%timeit numpy.sum(numpy.random.rand(1000,1000))
%timeit fast_sum(numpy.random.rand(1000,1000))
%timeit slow_sum(numpy.random.rand(1000,1000))
10 loops, best of 3: 20.5 ms per loop 1 loops, best of 3: 25 ms per loop 1 loops, best of 3: 426 ms per loop
If you have an array like this: [3,4,5,6,7]
and you want to move it to the right dropping the last, filling left with nans: [nan,3,4,5,6]
, what's the fastest way?
Answer:
result = numpy.roll(array, 1)
result[0] = numpy.nan
data = numpy.random.rand(1001)
%timeit result = numpy.insert(data, 0, numpy.nan)[:-2]
%timeit result = numpy.roll(data, 1); result[:1] = numpy.nan
10000 loops, best of 3: 127 µs per loop 100000 loops, best of 3: 14.3 µs per loop
Voronoi diagrams takes a set of points, and creates polygons enclosing the space closer to each point than any other. This is the dual of Delaunay triangulation, which matplotlib and scipy provide by default, and can also be created directly on NumPy.
Here's the speed generating via various methods:
Method Time (10K) Time (100K)
matplotlib.delaunay.triangulate.Triangulation 16.5ms 222ms
voronoi() using the above 41.9ms 793ms
scipy.spatial.Delaunay 51.4ms 797ms
import numpy
scale = .9
small = (1 - scale)/2 + scale * numpy.random.rand(2, 10000)
large = (1 - scale)/2 + scale * numpy.random.rand(2, 100000)
import matplotlib.delaunay.triangulate as tri
%timeit tri.Triangulation(*small)
%timeit tri.Triangulation(*large)
D:\anaconda\2.7\lib\site-packages\matplotlib\cbook.py:137: MatplotlibDeprecationWarning: The matplotlib.delaunay module was deprecated in version 1.4. Use matplotlib.tri.Triangulation instead. warnings.warn(message, mplDeprecation, stacklevel=1)
100 loops, best of 3: 19.5 ms per loop 1 loop, best of 3: 239 ms per loop
def voronoi(X, Y):
''' Return line segments describing the voronoi diagram of X and Y '''
# Get the points X, Y into a matrix P.
P = numpy.zeros((X.size+4, 2))
P[:X.size, 0], P[:Y.size, 1] = X, Y
# Add four points at (pseudo) "infinity"
m = max(numpy.abs(X).max(), numpy.abs(Y).max()) * 1e5
P[X.size:, 0] = -m, -m, +m, +m
P[Y.size:, 1] = -m, +m, -m, +m
# Delaunay triangulate, and get the circumcenters
D = tri.Triangulation(P[:, 0], P[:, 1])
C = D.circumcenters
# D.triangle_neighbours = 3 neighbours.
# Each neighbourhood represents a line.
n = len(C)
tgt = D.triangle_neighbors
src = (numpy.zeros_like(tgt).T + numpy.arange(n)).T
# Remove all -1s
positives = tgt >= 0
n = positives.sum()
src = src[positives].reshape(n)
tgt = tgt[positives].reshape(n)
# TODO: Clip to get polygons
# --------------------------
# Get areas
# ---------
# http://www.mathopenref.com/coordpolygonarea.html
csrc = C[src]
ctgt = C[tgt]
areas = csrc[:,0] * ctgt[:,1] - csrc[:,1] * ctgt[:,0]
# print areas
# Now add up the areas by the indices given in src
# Get the circumcenters
return numpy.concatenate((C[tgt].reshape(n, 1, 2), C[src].reshape(n, 1, 2)), axis=1)
%timeit voronoi(small[0,:], small[1,:])
%timeit voronoi(large[0,:], large[1,:])
The slowest run took 5.87 times longer than the fastest. This could mean that an intermediate result is being cached. 1 loop, best of 3: 39.6 ms per loop 1 loop, best of 3: 701 ms per loop
from scipy.spatial import Voronoi
%timeit Voronoi(small.T)
%timeit Voronoi(large.T)
10 loops, best of 3: 90 ms per loop 1 loop, best of 3: 1.29 s per loop
Which has the faster insert performance? Which has the faster read performance? This is specifically on a key-value index.
import random
words = 'ad adipisicing aliqua aliquip amet anim aute cillum commodo consectetur consequat culpa cupidatat deserunt do dolor dolore duis ea eiusmod elit enim esse est et eu ex excepteur exercitation fugiat id in incididunt ipsum irure labore laboris laborum lorem magna minim mollit nisi non nostrud nulla occaecat officia pariatur proident qui quis reprehenderit sed sint sit sunt tempor ullamco ut velit veniam voluptate'.split()
def get_data(size, keylen=5, vallen=10):
'Return an array of random key, int, str combinations'
result = []
hi = len(words) - 1
keys = set()
for index in range(size):
while True:
key = ' '.join(words[random.randint(0, hi)] for i in range(keylen))
if key not in keys:
break
keys.add(key)
num = random.randint(0, 10000000)
val = ' '.join(words[random.randint(0, hi)] for i in range(vallen))
result.append([key, num, val])
return result
import time
import sqlite3
def insert_sqlite3(data, drop=True):
conn = sqlite3.connect('.test.sqlite3')
try:
if drop:
conn.execute('DROP TABLE IF EXISTS test')
conn.execute('CREATE TABLE IF NOT EXISTS test (k TEXT, n INTEGER, v TEXT, PRIMARY KEY(k))')
start = time.time()
conn.executemany('INSERT INTO test VALUES (?, ?, ?)', data)
conn.commit()
return time.time() - start
finally:
conn.close()
import time
import psycopg2
def insert_postgres(data, drop=True):
conn = psycopg2.connect('host=localhost dbname=test user=postgres')
try:
cur = conn.cursor()
if drop:
cur.execute('DROP TABLE IF EXISTS test')
cur.execute('CREATE TABLE IF NOT EXISTS test (k VARCHAR(70), n INTEGER, v VARCHAR(300), PRIMARY KEY(k))')
conn.commit()
start = time.time()
cur = conn.cursor()
cur.executemany('INSERT INTO test VALUES (%s, %s, %s)', data)
conn.commit()
return time.time() - start
finally:
cur.close()
conn.close()
import time
import tables
class Test(tables.IsDescription):
k = tables.StringCol(itemsize=70, pos=0)
n = tables.Int16Col(pos=1)
v = tables.StringCol(itemsize=300, pos=2)
def insert_hdf5(data, drop=True):
handle = tables.open_file('.test.h5', mode='w')
try:
root = handle.root
table = handle.create_table(root, 'test', Test)
insert = table.row
start = time.time()
for row in data:
insert['k'], insert['n'], insert['v'] = row
insert.append()
insert.append()
table.flush()
return time.time() - start
finally:
handle.close()
print(' size sqlite3 postgres hdf5')
for size in (1, 10, 100, 1000, 10000, 25000):
print(
'% 6d' % size,
'% 7.1fms' % (insert_sqlite3(get_data(size), drop=True) * 1000),
'% 7.1fms' % (insert_postgres(get_data(size), drop=True) * 1000),
'% 7.1fms' % (insert_hdf5(get_data(size), drop=True) * 1000)
)
size sqlite3 postgres hdf5 1 289.3ms 0.0ms 0.0ms 10 192.9ms 44.3ms 0.0ms 100 200.7ms 31.3ms 0.0ms 1000 278.5ms 169.1ms 1.5ms 10000 435.5ms 1265.0ms 10.5ms 25000 648.7ms 3539.8ms 59.1ms
When it comes to data, the performance of the CPU/Memory, disk and database are key. This script measures how fast your system performs on these parameters.
The whole script should run under a minute on most reasonably fast systems.
from __future__ import print_function
import io
import time
import sqlalchemy
import numpy as np
import pandas as pd
from pathlib import Path
class Timer:
def __init__(self, msg):
self.msg = msg
def __enter__(self):
self.start = time.clock()
def __exit__(self, *args):
self.end = time.clock()
print('{:0.3f}s {:s}'.format(self.end - self.start, self.msg))
This is a pure numerical computation on values in memory that computes the eigenvalues of a random dataset.
# Initialise the same data every time
np.random.seed(0)
data = np.random.random((1000, 1000))
# Time the computation
with Timer('computation'):
np.linalg.eig(data)
1.751s computation
Let's time sequential writes and reads on the disk.
The best way to do this is via disktt on Windows and dd on Linux.
Below is a crude approximation in Python. Note: this is heavily influenced by OS disk caching.
# Change this to any folder in the drive you want to test
folder = Path('D:/')
# Test the speed of the hard disk at this folder
# ... with this string data
data = bytes('0123456789') * 100000000
# Run the test
path = folder / 'tempfile'
with path.open(mode='wb', buffering=0) as handle:
with Timer('sequential disk write'):
handle.write(data)
with path.open(mode='rb', buffering=0) as handle:
with Timer('sequential disk read'):
handle.read()
path.unlink()
12.596s sequential disk write 0.695s sequential disk read
This script tests the speed of MySQL. It assumes that a MySQL instance running on localhost and a database called test
accessible to user root
with no password. You can change the connection string based on your configuration.
# This is for a local MySQL database called test that you can connect to as root with no password
engine = sqlalchemy.create_engine('mysql+pymysql://root@localhost/dbtest')
# Test the connection
connection = engine.connect()
# Setup the data structures
data = pd.DataFrame(np.random.randint(0, 1000, (1000000, 3)))
metadata = sqlalchemy.MetaData(bind=engine)
metadata.reflect()
# Drop benchmark table
if 'benchmark' in metadata.tables:
metadata.tables['benchmark'].drop()
# Create benchmark table again as MyISAM
table = sqlalchemy.Table(
'benchmark', metadata,
sqlalchemy.Column('0', sqlalchemy.Integer),
sqlalchemy.Column('1', sqlalchemy.Integer),
sqlalchemy.Column('2', sqlalchemy.Integer),
extend_existing=True,
mysql_engine='MyISAM',
)
metadata.create_all()
with Timer('database write'):
data.to_sql('benchmark', con=engine, if_exists='append', index=False)
with Timer('database read'):
data = pd.read_sql('benchmark', con=engine)
39.400s database write 19.366s database read
Pandas processes categories much faster than text. For this sample:
text categories
.groupby() time 591ms 74ms Categories are ~8X faster in this case
Memory usage 512MB 86MB Categories are much smaller (based on text length)
import pandas as pd
cat1 = ['Alpha', 'Beta', 'Gamma', 'Delta', 'Epsilon', 'Zeta', 'Eta']
n = 10000000
text = pd.DataFrame({
'cat1': pd.Series(pd.np.random.randint(0, len(cat1), n)).map(dict(enumerate(cat1))),
'val': pd.np.random.rand(n)
})
cats = text.copy()
cats['cat1'] = cats['cat1'].astype('category')
%timeit text.groupby('cat1')['val'].sum()
%timeit cats.groupby('cat1')['val'].sum()
1 loop, best of 3: 591 ms per loop 10 loops, best of 3: 73.8 ms per loop
text.info(memory_usage='deep')
cats.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000000 entries, 0 to 9999999 Data columns (total 2 columns): cat1 object val float64 dtypes: float64(1), object(1) memory usage: 512.3 MB <class 'pandas.core.frame.DataFrame'> RangeIndex: 10000000 entries, 0 to 9999999 Data columns (total 2 columns): cat1 category val float64 dtypes: category(1), float64(1) memory usage: 85.8 MB
Observations on the performance of a SELECT category, COUNT(id) FROM table GROUP BY category
query:
category
does not speed it up (1.5s)VARCHAR
instead of TEXT
does not speed it upreset query cache
to reset the cacheTo move a large table into an in-memory table:
SET GLOBAL tmp_table_size = 1024 * 1024 * 1024 * 2;
SET GLOBAL max_heap_table_size = 1024 * 1024 * 1024 * 2;
# Now disconnect and reconnect
CREATE TABLE mem LIKE inr_import;
ALTER TABLE mem ENGINE=MEMORY;
INSERT INTO mem SELECT * FROM inr_import;
PostgreSQL:
category
does not speed it up