For all the hype about big data, much value resides in the world's medium and small data. Especially when we consider the length of the feedback loop and total analyst time invested, insights from small and medium data are quite attractive and economical. Personally, I find analyzing data that fits into memory quite convenient, and therefore, when I am confronted with a data set that does not fit in memory as-is, I am willing to spend a bit of time to try to manipulate it to fit into memory.
The first technique I usually turn to is to only store distinct rows of a data set, along with the count of the number of times that row appears in the data set. This technique is fairly simple to implement, especially when the data set is generated by a SQL query. If the initial query that generates the data set is
SELECT u, v, w FROM t;
we would modify it to become
SELECT u, v, w, COUNT(1) FROM t GROUP BY u, v, w;
We now generate a sample data set with both discrete and continuous features.
%matplotlib inline
from __future__ import division
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from patsy import dmatrices, dmatrix
import scipy as sp
import seaborn as sns
from statsmodels import api as sm
from statsmodels.base.model import GenericLikelihoodModel
np.random.seed(1545721) # from random.org
N = 100001
u_min, u_max = 0, 100
v_p = 0.6
n_ws = 50
ws = sp.stats.norm.rvs(0, 1, size=n_ws)
w_min, w_max = ws.min(), ws.max()
df = pd.DataFrame({
'u': np.random.randint(u_min, u_max, size=N),
'v': sp.stats.bernoulli.rvs(v_p, size=N),
'w': np.random.choice(ws, size=N, replace=True)
})
df.head()
u | v | w | |
---|---|---|---|
0 | 97 | 0 | 0.537397 |
1 | 79 | 1 | 1.536383 |
2 | 44 | 1 | 1.074817 |
3 | 51 | 0 | -0.491210 |
4 | 47 | 1 | 1.592646 |
We see that this data frame has just over 100,000 rows, but only about 10,000 distinct rows.
df.shape[0]
100001
df.drop_duplicates().shape[0]
9997
We now use pandas
' groupby
method to produce a data frame that contains the count of each unique combination of x
, y
, and z
.
count_df = df.groupby(list(df.columns)).size()
count_df.name = 'count'
count_df = count_df.reset_index()
In order to make later examples interesting, we shuffle the rows of the reduced data frame, because pandas
automatically sorts the values we grouped on in the reduced data frame.
shuffled_ixs = count_df.index.values
np.random.shuffle(shuffled_ixs)
count_df = count_df.iloc[shuffled_ixs].copy().reset_index(drop=True)
count_df.head()
u | v | w | count | |
---|---|---|---|---|
0 | 0 | 0 | 0.425597 | 14 |
1 | 48 | 1 | -0.993981 | 7 |
2 | 35 | 0 | 0.358156 | 9 |
3 | 19 | 1 | -0.760298 | 17 |
4 | 40 | 1 | -0.688514 | 13 |
Again, we see that we are storing 90% fewer rows. Although this data set has been artificially generated, I have seen space savings of up to 98% when applying this technique to real-world data sets.
count_df.shape[0] / N
0.0999690003099969
This space savings allows me to analyze data sets which initially appear too large to fit in memory. For example, the computer I am writing this on has 16 GB of RAM. At a 90% space savings, I can comfortably analyze a data set that might otherwise be 80 GB in memory while leaving a healthy amount of memory for other processes. To me, the convenience and tight feedback loop that come with fitting a data set entirely in memory are hard to overstate.
As nice as it is to fit a data set into memory, it's not very useful unless we can still analyze it. The rest of this post will show how we can perform standard operations on these summary data sets.
For convenience, we will separate the feature columns from the count columns.
summ_df = count_df[['u', 'v', 'w']]
n = count_df['count']
Suppose we have a group of numbers x1,x2,…,xn. Let the unique values among these numbers be denoted z1,z2,…,zm and let nj be the number of times zj apears in the original group. The mean of the xis is therefore
ˉx=1nn∑i=1xi=1nm∑j=1njzj,since we may group identical xis into a single summand. Since n=∑mj=1nj, we can calculate the mean using the following function.
def mean(df, count):
return df.mul(count, axis=0).sum() / count.sum()
mean(summ_df, n)
u 49.308067 v 0.598704 w 0.170815 dtype: float64
We see that the means calculated by our function agree with the means of the original data frame.
df.mean(axis=0)
u 49.308067 v 0.598704 w 0.170815 dtype: float64
np.allclose(mean(summ_df, n), df.mean(axis=0))
True
We can calculate the variance as
σ2x=1n−1n∑i=1(xi−ˉx)2=1n−1m∑j=1nj(zj−ˉx)2using the same trick of combining identical terms in the original sum. Again, this calculation is easy to implement in Python.
def var(df, count):
mu = mean(df, count)
return np.power(df - mu, 2).mul(count, axis=0).sum() / (count.sum() - 1)
var(summ_df, n)
u 830.025064 v 0.240260 w 1.099191 dtype: float64
We see that the variances calculated by our function agree with the variances of the original data frame.
df.var()
u 830.025064 v 0.240260 w 1.099191 dtype: float64
np.allclose(var(summ_df, n), df.var(axis=0))
True
fig, (full_ax, summ_ax) = plt.subplots(ncols=2, sharex=True, sharey=True, figsize=(16, 6))
nbins = 20
blue, green = sns.color_palette()[:2]
full_ax.hist(df.w, bins=nbins, color=blue, alpha=0.5, lw=0);
full_ax.set_xlabel('$w$');
full_ax.set_ylabel('Count');
full_ax.set_title('Full data frame');
summ_ax.hist(summ_df.w, bins=nbins, weights=n, color=green, alpha=0.5, lw=0);
summ_ax.set_xlabel('$w$');
summ_ax.set_title('Summarized data frame');
We see that the histograms for w produced from the full and summarized data frames are identical.
Calculating the mean and variance of our summarized data frames was not too difficult. Calculating quantiles from this data frame is slightly more involved, though still not terribly hard.
Our implementation will rely on sorting the data frame. Though this implementation is not optimal from a computation complexity point of view, it is in keeping with the spirit of pandas
' implementation of quantiles. I have given some thought on how to implement linear time selection on the summarized data frame, but have not yet worked out the details.
Before writing a function to calculate quantiles of a data frame with several columns, we will walk through the simpler case of computing the quartiles of a single series.
u = summ_df.u
u.head()
0 0 1 48 2 35 3 19 4 40 Name: u, dtype: int64
First we argsort
the series.
sorted_ilocs = u.argsort()
We see that u.iloc[sorted_ilocs]
will now be in ascending order.
sorted_u = u.iloc[sorted_ilocs]
(sorted_u[:-1] <= sorted_u[1:]).all()
True
More importantly, counts.iloc[sorted_ilocs]
will have the count of the smallest element of u
first, the count of the second smallest element second, etc.
sorted_n = n.iloc[sorted_ilocs]
sorted_cumsum = sorted_n.cumsum()
cdf = (sorted_cumsum / n.sum()).values
Now, the i-th location of sorted_cumsum
will contain the number of elements of u
less than or equal to the i-th smallest element, and therefore cdf
is the empirical cumulative distribution function of u
. The following plot shows that this interpretation is correct.
fig, ax = plt.subplots(figsize=(8, 6))
blue, _, red = sns.color_palette()[:3]
ax.plot(sorted_u, cdf, c=blue, label='Empirical CDF');
plot_u = np.arange(100)
ax.plot(plot_u, sp.stats.randint.cdf(plot_u, u_min, u_max), '--', c=red, label='Population CDF');
ax.set_xlabel('$u$');
ax.legend(loc=2);
If, for example, we wish to find the median of u
, we want to find the first location in cdf
which is greater than or equal to 0.5.
median_iloc_in_sorted = (cdf < 0.5).argmin()
The index of the median in u
is therefore sorted_ilocs.iloc[median_iloc_in_sorted]
, so the median of u
is
u.iloc[sorted_ilocs.iloc[median_iloc_in_sorted]]
49
df.u.quantile(0.5)
49.0
We can generalize this method to calculate multiple quantiles simultaneously as follows.
q = np.array([0.25, 0.5, 0.75])
u.iloc[sorted_ilocs.iloc[np.less.outer(cdf, q).argmin(axis=0)]]
2299 24 9079 49 1211 74 Name: u, dtype: int64
df.u.quantile(q)
0.25 24 0.50 49 0.75 74 dtype: float64
The array np.less.outer(cdf, q).argmin(axis=0)
contains three columns, each of which contains the result of comparing cdf
to an element of q
. The following function generalizes this approach from series to data frames.
def quantile(df, count, q=0.5):
q = np.ravel(q)
sorted_ilocs = df.apply(pd.Series.argsort)
sorted_counts = sorted_ilocs.apply(lambda s: count.iloc[s].values)
cdf = sorted_counts.cumsum() / sorted_counts.sum()
q_ilocs_in_sorted_ilocs = pd.DataFrame(np.less.outer(cdf.values, q).argmin(axis=0).T,
columns=df.columns)
q_ilocs = sorted_ilocs.apply(lambda s: s[q_ilocs_in_sorted_ilocs[s.name]].reset_index(drop=True))
q_df = df.apply(lambda s: s.iloc[q_ilocs[s.name]].reset_index(drop=True))
q_df.index = q
return q_df
quantile(summ_df, n, q=q)
u | v | w | |
---|---|---|---|
0.25 | 24 | 0 | -0.688514 |
0.50 | 49 | 1 | 0.040036 |
0.75 | 74 | 1 | 1.074817 |
df.quantile(q=q)
u | v | w | |
---|---|---|---|
0.25 | 24 | 0 | -0.688514 |
0.50 | 49 | 1 | 0.040036 |
0.75 | 74 | 1 | 1.074817 |
np.allclose(quantile(summ_df, n, q=q), df.quantile(q=q))
True
Another important operation is [bootstrapping](https://en.wikipedia.org/wiki/Bootstrapping_(statistics%29). We will see two ways to perfom bootstrapping on the summary data set.
n_boot = 10000
Key to both approaches to the bootstrap is knowing the proprotion of the data set that each distinct combination of features comprised.
weights = n / n.sum()
The two approaches differ in what type of data frame they produce. The first we will discuss produces a non-summarized data frame with non-unique rows, while the second produces a summarized data frame. Each fo these approaches to bootstrapping is useful in different situations.
To produce a non-summarized data frame, we generate a list of locations in feature_df
based on weights
using numpy.random.choice
.
boot_ixs = np.random.choice(summ_df.shape[0], size=n_boot, replace=True,
p=weights)
boot_df = summ_df.iloc[boot_ixs]
boot_df.head()
u | v | w | |
---|---|---|---|
1171 | 47 | 1 | -1.392235 |
9681 | 3 | 1 | 0.018521 |
6664 | 13 | 1 | 1.941207 |
8343 | 13 | 0 | 0.655181 |
3595 | 95 | 1 | 0.972592 |
We can verify that our bootstrapped data frame has (approximately) the same distribution as the original data frame using Q-Q plots.
ps = np.linspace(0, 1, 100)
boot_qs = boot_df[['u', 'w']].quantile(q=ps)
qs = df[['u', 'w']].quantile(q=ps)
fig, ax = plt.subplots(figsize=(8, 6))
blue = sns.color_palette()[0]
ax.plot((u_min, u_max), (u_min, u_max), '--', c='k', lw=0.75,
label='Perfect agreement');
ax.scatter(qs.u, boot_qs.u, c=blue, alpha=0.5);
ax.set_xlim(u_min, u_max);
ax.set_xlabel('Original quantiles');
ax.set_ylim(u_min, u_max);
ax.set_ylabel('Resampled quantiles');
ax.set_title('Q-Q plot for $u$');
ax.legend(loc=2);
fig, ax = plt.subplots(figsize=(8, 6))
blue = sns.color_palette()[0]
ax.plot((w_min, w_max), (w_min, w_max), '--', c='k', lw=0.75,
label='Perfect agreement');
ax.scatter(qs.w, boot_qs.w, c=blue, alpha=0.5);
ax.set_xlim(w_min, w_max);
ax.set_xlabel('Original quantiles');
ax.set_ylim(w_min, w_max);
ax.set_ylabel('Resampled quantiles');
ax.set_title('Q-Q plot for $w$');
ax.legend(loc=2);
We see that both of the resampled distributions agree quite closely with the original distributions. We have only produced Q-Q plots for u and w because v is binary-valued.
While at first non-summarized boostrap resampling may appear to counteract the benefits of summarizing the original data frame, it can be quite useful when training and evaluating online learning algorithms, where iterating through the locations of the bootstrapped data in the original summarized data frame is efficient.
To produce a summarized data frame, the counts of the resampled data frame are sampled from a multinomial distribution with event probabilities given by weights
.
boot_counts = pd.Series(np.random.multinomial(n_boot, weights), name='count')
Again, we compare the distribution of our bootstrapped data frame to that of the original with Q-Q plots. Here our summarized quantile function is quite useful.
boot_count_qs = quantile(summ_df, boot_counts, q=ps)
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot((u_min, u_max), (u_min, u_max), '--', c='k', lw=0.75,
label='Perfect agreement');
ax.scatter(qs.u, boot_count_qs.u, c=blue, alpha=0.5);
ax.set_xlim(u_min, u_max);
ax.set_xlabel('Original quantiles');
ax.set_ylim(u_min, u_max);
ax.set_ylabel('Resampled quantiles');
ax.set_title('Q-Q plot for $u$');
ax.legend(loc=2);
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot((w_min, w_max), (w_min, w_max), '--', c='k', lw=0.75,
label='Perfect agreement');
ax.scatter(qs.w, boot_count_qs.w, c=blue, alpha=0.5);
ax.set_xlim(w_min, w_max);
ax.set_xlabel('Original quantiles');
ax.set_ylim(w_min, w_max);
ax.set_ylabel('Resampled quantiles');
ax.set_title('Q-Q plot for $w$');
ax.legend(loc=2);
Again, we see that both of the resampled distributions agree quite closely with the original distributions.
Linear regression is among the most frequently used types of statistical inference, and it plays nicely with summarized data. Typically, we have a response variable y that we wish to model as a linear combination of u, v, and w as
yi=β0+β1ui+β2vi+β3wi+ε,where ε∼N(0,σ2) is noise. We generate such a data set below (with σ=0.1).
beta = np.array([-3., 0.1, -4., 2.])
noise_std = 0.1
X = dmatrix('u + v + w', data=df)
y = pd.Series(np.dot(X, beta), name='y') + sp.stats.norm.rvs(scale=noise_std, size=N)
y.head()
0 7.862559 1 3.830585 2 -0.388246 3 1.047091 4 0.992082 Name: y, dtype: float64
Each element of the series y
corresponds to one row in the uncompressed data frame df
. The OLS
class from statsmodels
comes quite close to recovering the true regression coefficients.
full_ols = sm.OLS(y, X).fit()
full_ols.params
const -2.999658 x1 0.099986 x2 -3.998997 x3 2.000317 dtype: float64
To show how we can perform linear regression on the summarized data frame, we recall the the ordinary least squares estimator minimizes the residual sum of squares. The residual sum of squares is given by
RSS=n∑i=1(yi−xiβ⊺)2.Here xi=[1 ui vi wi] is the i-th row of the original data frame (with a constant added for the intercept) and β=[β0 β1 β2 β3] is the row vector of regression coefficients. It would be tempting to rewrite RSS by grouping the terms based on the row their features map to in the compressed data frame, but this approach would lead to incorrect results. Due to the stochastic noise term εi, identical values of u, v, and w can (and will almost certainly) map to different values of y. We can see this phenomenon by calculating the range of y grouped on u, v, and w.
reg_df = pd.concat((y, df), axis=1)
reg_df.groupby(('u', 'v', 'w')).y.apply(np.ptp).describe()
count 9997.000000 mean 0.297891 std 0.091815 min 0.000000 25% 0.237491 50% 0.296838 75% 0.358015 max 0.703418 Name: y, dtype: float64
If y were uniquely determined by u, v, and w, we would expect the mean and quartiles of these ranges to be zero, which they are not. Fortunately, we can account for is difficulty with a bit of care.
Let Sj={i | xi=zj}, the set of row indices in the original data frame that correspond to the j-th row in the summary data frame. Define ˉy(j)=1nj∑i∈Sjyi, which is the mean of the response variables that correspond to zj. Intuitively, since εi has mean zero, ˉy(j) is our best unbiased estimate of zjβ⊺. We will now show that regressing √njˉy(j) on √njzj gives the same results as the full regression. We use the standard trick of adding and subtracting the mean and get
RSS=m∑j=1∑i∈Sj(yi−zjβ⊺)2=m∑j=1∑i∈Sj((yi−ˉy(j))+(ˉy(j)−zjβ⊺))2=m∑j=1∑i∈Sj((yi−ˉy(j))2+2(yi−ˉy(j))(ˉy(j)−zjβ⊺)+(ˉy(j)−zjβ⊺)2).As is usual in these situations, the cross term vanishes, since
∑i∈Sj(yi−ˉy(j))(ˉy(j)−zjβ⊺)=∑i∈Sj(yiˉy(j)−yizjβ⊺−ˉy2(j)+ˉy(j)zjβ⊺)=ˉy(j)∑i∈Sjyi−zjβ⊺∑i∈Sjyi−njˉy2(j)+njˉy(j)zjβ⊺=njˉy2(j)−njˉy(j)zjβ⊺−njˉy2(j)+njˉy(j)zjβ⊺=0.Therefore we may decompose the residual sum of squares as
RSS=m∑j=1∑i∈Sj(yi−ˉy(j))2+m∑j=1∑i∈Sj(ˉy(j)−zjβ⊺)2=m∑j=1∑i∈Sj(yi−ˉy(j))2+m∑j=1nj(ˉy(j)−zjβ⊺)2.The important property of this decomposition is that the first sum does not depend on β, so minimizing RSS with respect to β is equivalent to minimizing the second sum. We see that this second sum can be written as
m∑j=1nj(ˉy(j)−zjβ⊺)2=m∑j=1(√njˉy(j)−√njzjβ⊺)2,which is exactly the residual sum of squares for regressing √njˉy(j) on √njzj.
summ_reg_df = reg_df.groupby(('u', 'v', 'w')).y.mean().reset_index().iloc[shuffled_ixs].reset_index(drop=True).copy()
summ_reg_df['n'] = n
summ_reg_df.head()
u | v | w | y | n | |
---|---|---|---|---|---|
0 | 0 | 0 | 0.425597 | -2.173984 | 14 |
1 | 48 | 1 | -0.993981 | -4.174895 | 7 |
2 | 35 | 0 | 0.358156 | 1.252848 | 9 |
3 | 19 | 1 | -0.760298 | -6.612355 | 17 |
4 | 40 | 1 | -0.688514 | -4.379063 | 13 |
The design matrices for this summarized model are easy to construct using patsy
.
y_summ, X_summ = dmatrices("""
I(np.sqrt(n) * y) ~
np.sqrt(n) + I(np.sqrt(n) * u) + I(np.sqrt(n) * v) + I(np.sqrt(n) * w) - 1
""",
data=summ_reg_df)
Note that we must remove patsy
's constant column for the intercept and replace it with np.sqrt(n)
.
summ_ols = sm.OLS(y_summ, X_summ).fit()
summ_ols.params
array([-2.99965783, 0.09998571, -3.99899718, 2.00031673])
We see that the summarized regression produces the same parameter estimates as the full regression.
np.allclose(full_ols.params, summ_ols.params)
True
As a final example of adapting common methods to summarized data frames, we will show how to fit a logistic regression model on a summarized data set by maximum likelihood.
We will use the model
P(s=1 | w)=11+exp(−xγ⊺).
As above, xi=[1 ui vi wi]. The true value of γ is
gamma = np.array([1., 0.01, -1., -2.])
We now generate samples from this model.
X = dmatrix('u + v + w', data=df)
p = pd.Series(sp.special.expit(np.dot(X, gamma)), name='p')
s = pd.Series(sp.stats.bernoulli.rvs(p), name='s')
logit_df = pd.concat((s, p, df), axis=1)
logit_df.head()
s | p | u | v | w | |
---|---|---|---|---|---|
0 | 1 | 0.709963 | 97 | 0 | 0.537397 |
1 | 0 | 0.092560 | 79 | 1 | 1.536383 |
2 | 0 | 0.153211 | 44 | 1 | 1.074817 |
3 | 1 | 0.923609 | 51 | 0 | -0.491210 |
4 | 0 | 0.062077 | 47 | 1 | 1.592646 |
We first fit the logistic regression model to the full data frame.
full_logit = sm.Logit(s, X).fit()
Optimization terminated successfully. Current function value: 0.414221 Iterations 7
full_logit.params
const 0.965283 x1 0.009944 x2 -0.966797 x3 -1.990506 dtype: float64
We see that the estimates are quite close to the true parameters.
The technique used to adapt maximum likelihood estimation of logistic regression to the summarized data frame is quite elegant. The likelihood for the full data set is given by the fact that (given u, v, and w) s is Bernoulli distributed with
si | xi∼Ber(11+exp(−xiγ⊺)).To derive the likelihood for the summarized data set, we count the number of successes (where s=1) for each unique combination of features zj, and denote this quantity kj.
summ_logit_df = logit_df.groupby(('u', 'v', 'w')).s.sum().reset_index().iloc[shuffled_ixs].reset_index(drop=True).copy()
summ_logit_df = summ_logit_df.rename(columns={'s': 'k'})
summ_logit_df['n'] = n
summ_logit_df.head()
u | v | w | k | n | |
---|---|---|---|---|---|
0 | 0 | 0 | 0.425597 | 5 | 14 |
1 | 48 | 1 | -0.993981 | 7 | 7 |
2 | 35 | 0 | 0.358156 | 8 | 9 |
3 | 19 | 1 | -0.760298 | 12 | 17 |
4 | 40 | 1 | -0.688514 | 9 | 13 |
Now, instead of each row representing a single Bernoulli trial (as in the full data frame), each row represents nj trials, so we have that kj is (conditionally) Binomially distributed with
kj | zj∼Bin(nj,11+exp(−zjγ⊺)).summ_logit_X = dmatrix('u + v + w', data=summ_logit_df)
As I have shown in a previous post, we can use statsmodels
' GenericLikelihoodModel
class to fit custom probability models by maximum likelihood. The model is implemented as follows.
class SummaryLogit(GenericLikelihoodModel):
def __init__(self, endog, exog, n, **qwargs):
"""
endog is the number of successes
exog are the features
n are the number of trials
"""
self.n = n
super(SummaryLogit, self).__init__(endog, exog, **qwargs)
def nloglikeobs(self, gamma):
"""
gamma is the vector of regression coefficients
returns the negative log likelihood of each of the observations for
the coefficients in gamma
"""
p = sp.special.expit(np.dot(self.exog, gamma))
return -sp.stats.binom.logpmf(self.endog, self.n, p)
def fit(self, start_params=None, maxiter=10000, maxfun=5000, **qwargs):
# wraps the GenericLikelihoodModel's fit method to set default start parameters
if start_params is None:
start_params = np.zeros(self.exog.shape[1])
return super(SummaryLogit, self).fit(start_params=start_params,
maxiter=maxiter, maxfun=maxfun, **qwargs)
summ_logit = SummaryLogit(summ_logit_df.k, summ_logit_X, summ_logit_df.n).fit()
Optimization terminated successfully. Current function value: 1.317583 Iterations: 357 Function evaluations: 599
Again, we get reasonable estimates of the regression coefficients, which are close to those obtained from the full data set.
summ_logit.params
np.allclose(summ_logit.params, full_logit.params, rtol=10**-4)
Hopefully this introduction to the technique of summarizing data sets has proved useful and will allow you to explore medium data more easily in the future. We have only scratched the surface on the types of statistical techniques that can be adapted to work on summarized data sets, but with a bit of ingenuity, many of the ideas in this post can apply to other models.