import pandas as pd
import numpy as np
import pandas_composition.lazy
df = pd.DataFrame(np.random.randn(10000, 5))
lf = df.lazy()
df
<class 'pandas.core.frame.DataFrame'> Int64Index: 10000 entries, 0 to 9999 Data columns (total 5 columns): 0 10000 non-null values 1 10000 non-null values 2 10000 non-null values 3 10000 non-null values 4 10000 non-null values dtypes: float64(5)
Basic math operators such as add, sub, div, mul, pow will be deferred. You can see how the expressions are built by just trying some out. Not that _pobjN
is used as a placeholder for non-scalar values.
lf
lf + 1
(lf + 1) / lf - 1
LazyFrame uses regular python evaluation so it will follow pythons order of operations.
expr = 2 ** (lf + 3) * 2
expr
eval
will evaluate and run the function expression through numexpr.
It takes a inplace
parameter which is defaulted to False
expr.eval()
<class 'pandas.core.frame.DataFrame'> Int64Index: 10000000 entries, 0 to 9999999 Columns: 5 entries, 0 to 4 dtypes: float64(5)
If something requires the LazyFrame
to be evaluated and cannot be deferred. LazyFrame will do so automaticaly. Normally this is triggered by an attribute call that we don't know how to defer. Examples are:
__array__
columns
values
expr = lf * 2
assert not expr.evaled
res = expr > 0
assert expr.evaled
res.head()
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | False | False | True | True | False |
1 | False | True | True | True | True |
2 | False | True | False | False | False |
3 | False | True | True | True | True |
4 | True | False | False | False | False |
Due to the lazy eval and numexpr evaluation. Certain types of operations will run faster when deferred
df = pd.DataFrame(np.random.randn(10000000, 5))
lf = df.lazy()
%timeit df + df + df + df +df
1 loops, best of 3: 969 ms per loop
%timeit (lf + lf + lf + lf + lf).eval()
1 loops, best of 3: 268 ms per loop
correct = df + df + df + df + df
test = (lf + lf + lf + lf + lf ).eval()
pd.util.testing.assert_almost_equal(correct, test)
True