In [1]:

import pandas as pd
import numpy as np
import pandas_composition.lazy

In [50]:

df = pd.DataFrame(np.random.randn(10000, 5))
lf = df.lazy()

In [51]:

df

Out[51]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Data columns (total 5 columns):
0    10000  non-null values
1    10000  non-null values
2    10000  non-null values
3    10000  non-null values
4    10000  non-null values
dtypes: float64(5)

Deferred Operations¶

Basic math operators such as add, sub, div, mul, pow will be deferred. You can see how the expressions are built by just trying some out. Not that _pobjN is used as a placeholder for non-scalar values.

In [52]:

lf

Out[52]:

LazyFrame: _pobj1

In [53]:

lf + 1

Out[53]:

LazyFrame: (_pobj1 + 1)

In [58]:

(lf + 1) / lf - 1

Out[58]:

LazyFrame: (((_pobj1 + 1) / _pobj2) - 1)

Ordering¶

LazyFrame uses regular python evaluation so it will follow pythons order of operations.

In [33]:

expr = 2 ** (lf + 3) * 2
expr

Out[33]:

LazyFrame: (((_pobj1 + 3) ** 2) * 2)

eval¶

eval will evaluate and run the function expression through numexpr.

It takes a inplace parameter which is defaulted to False

In [49]:

expr.eval()

Out[49]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000000 entries, 0 to 9999999
Columns: 5 entries, 0 to 4
dtypes: float64(5)

If something requires the LazyFrame to be evaluated and cannot be deferred. LazyFrame will do so automaticaly. Normally this is triggered by an attribute call that we don't know how to defer. Examples are:

__array__
columns
values

In [48]:

expr = lf * 2
assert not expr.evaled 
res = expr > 0
assert expr.evaled
res.head()

Out[48]:

	0	1	2	3	4
0	False	False	True	True	False
1	False	True	True	True	True
2	False	True	False	False	False
3	False	True	True	True	True
4	True	False	False	False	False

Performance¶

Due to the lazy eval and numexpr evaluation. Certain types of operations will run faster when deferred

In [43]:

df = pd.DataFrame(np.random.randn(10000000, 5))
lf = df.lazy()

In [44]:

%timeit df + df + df + df +df

1 loops, best of 3: 969 ms per loop

In [45]:

%timeit (lf + lf + lf + lf + lf).eval()

1 loops, best of 3: 268 ms per loop

In [46]:

correct = df + df + df + df + df
test = (lf + lf + lf + lf + lf ).eval()

In [47]:

pd.util.testing.assert_almost_equal(correct, test)

Out[47]:

True