Example usage of hierreg¶

This IPython notebook tries to illustrate the usage of the hierreg package, which fits linear models with coefficients that can vary according to ‘groups’ or levels to which observations belong, similar as random effects and hierarchical bayes models, but with deviations estimated by applying regularization on them.

In a typical regression model, coefficients are estimated so as to minimize $$ L(w) = \lVert y - Xw \lVert$$ Where

$X$ is the predictors'covariates matrix
$y$ is the variable to predict
$w$ are the model parameters (that minimize this metric)

This model assumes that observations can belong to arbitrary categorical groups (e.g. observations coming from the same geographical region, same year, etc.), and the coefficients are different for all these groups, but not too different, by trying to minimize

$$ L(w,v) = \lVert X(w+\sum_{groups} v_{group}I_{[x \in group]}) - y\lVert +\:\: \lambda *\lVert groupweights *v \lVert $$

Where

$X$ is the predictors'covariates matrix
$y$ is the variable to predict
$\lambda$ is a regularization parameter
$w$ are the main model parameters (like fixed effects)
$v_{group}$ are deviations in the coefficients according to the group each observation belongs to
$I_{[x \in group]}$ is an indicator column-matrix with ones when a row of $X$ belongs to a given group
$groupweights$ is a weight vector making each group deviation have a weight that is inversely proportional to the number of observations beloning to that group.

While this type of model doesn't offer the same inferential or statistical testing capabilities as random effects or hierarchical bayes, it can end up being better at making predictions than either alternatives, depending on the problem.

(Logistic regression is also supported, the model needs to be called with argument problem='classification')

Package in action¶

1. Small example - Predicting student attainment

2. Larger example - Predicting sales at different stores

1. Small example - Predicting student attainment¶

Here I'll try to build a model to predict student attaintment based on standardized test scores, family information and schools that children attended - the data was downloaded from here: http://www.bristol.ac.uk/cmm/learning/mmsoftware/data-rev.html#lev-xc

Description from the webpage

2LEV-XC¶

The data are on 3,435 children who attended 148 primary schools and 19 secondary schools in Scotland. There are 11 fields in the data set of which the following are used.

VRQ: A verbal reasoning score from tests pupils took when they entered secondary school
ATTAIN: Attainment score of pupils at age 16
PID: Primary school identifying code
SEX: Pupil's gender
0 = boy
1 = girl
SC: Pupil's social class scale (continuous scale score from low to high social class)
SID: Secondary school identifying code

Taking a look at the data:

In [1]:

import pandas as pd, numpy as np, warnings
warnings.filterwarnings("ignore")

df=pd.read_csv('2lev-xc.txt',names=['VRQ','ATTAIN','PID','SEX','SC','SID'],sep='\s',engine='python')
df.head()

Out[1]:

	VRQ	ATTAIN	PID	SEX	SC	SID
0	11.0	10.0	1.0	0	0.0	9.0
1	0.0	3.0	1.0	1	0.0	9.0
2	-14.0	2.0	1.0	0	0.0	9.0
3	-6.0	3.0	1.0	0	20.0	9.0
4	-30.0	2.0	1.0	1	0.0	9.0

There are, overall, over a hundred primary schools and over a dozen secondary schools. For a dataset of this size, this is a very large amount of parameters:

In [2]:

print(len(set(df.PID)))
print(len(set(df.SID)))

148
19

Now fitting a model with school-varying coefficients:

In [3]:

from hierreg import HierarchicalRegression
from sklearn.model_selection import train_test_split

y=df['ATTAIN']
X=df[['VRQ','SEX','SC']]
groups=df[['PID','SID']]

## simple train-test split - note that I'm suing a random seed to get the same split again later
X_train, X_test, y_train, y_test, groups_train, groups_test = train_test_split(
     X, y, groups, test_size=0.33, random_state=42)

## for small problems like this, solver 'ECOS' works faster than the default 'SCS'
## it would still work fine without chaning the default options though, only a bit slower
## note the large regularization parameter
hr=HierarchicalRegression(l2_reg=1e4, cvxpy_opts={'solver':'ECOS'})
hr.fit(X_train, y_train, groups_train)
print(np.mean((y_test-hr.predict(X_test, groups_test))**2))

4.503111161611763

The choice of which variables to turn into groups with deviations in coefficients and which to use as categorical variables can be quite hard though - in this very simple example, a tuned regularized model with no group-variying coefficients ends up performing better:

In [4]:

from sklearn.linear_model import Ridge

y=df['ATTAIN']
X=df[['VRQ','SEX','SC','PID','SID']]
X['PID']=X.PID.map(lambda x: str(x))
X['SID']=X.SID.map(lambda x: str(x))
X=pd.get_dummies(X)

## same train-test split
X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.33, random_state=42)

lr=Ridge(100).fit(X_train, y_train)
print(np.mean((y_test-lr.predict(X_test))**2))

4.386018754899887

(With larger datasets with more variables, this might not necessarily be the case though, as illustrated in the next example)

2. Larger example - Predicting sales at different stores¶

This larger example tries to predict sales at different Rossman stores. The data comes from a Kaggle competition and it's quite large compared to the previous dataset. This is no the typical time-series forecasting problem as the rows here include the number of customer who visited each store at each day, along with information about the days and about competitors:

In [5]:

rossman=pd.read_csv('train.csv',engine='python')
rossman['StateHoliday']=rossman.StateHoliday.map(lambda x: str(x))

# exlcuding days with no sales
rossman=rossman.loc[rossman.Sales>0]
rossman.head()

Out[5]:

	Store	DayOfWeek	Date	Sales	Customers	Open	Promo	SchoolHoliday
0	1	5	2015-07-31	5263	555	1	1	1
1	2	5	2015-07-31	6064	625	1	1	1
2	3	5	2015-07-31	8314	821	1	1	1
3	4	5	2015-07-31	13995	1498	1	1	1
4	5	5	2015-07-31	4822	559	1	1	1

In [6]:

rossman.shape

Out[6]:

(844338, 9)

In [7]:

stores=pd.read_csv('C:\\ipython\\mixed effects\\rossman sales\\store.csv')
stores.head()

Out[7]:

	Store	StoreType	Assortment	CompetitionDistance	CompetitionOpenSinceMonth	CompetitionOpenSinceYear	Promo2	Promo2SinceWeek	Promo2SinceYear	PromoInterval
0	1	c	a	1270.0	9.0	2008.0	0	NaN	NaN	NaN
1	2	a	a	570.0	11.0	2007.0	1	13.0	2010.0	Jan,Apr,Jul,Oct
2	3	a	a	14130.0	12.0	2006.0	1	14.0	2011.0	Jan,Apr,Jul,Oct
3	4	c	c	620.0	9.0	2009.0	0	NaN	NaN	NaN
4	5	a	a	29910.0	4.0	2015.0	0	NaN	NaN	NaN

Doing some very basic feature engineering - there is a lot more that can be done, such as using time lags of sales and customers, and incorporating more information about competitors, among others, but the idea is just to illustrate the usage of this package.

In [8]:

rossman['Year']=rossman.Date.map(lambda x: x[:4])
rossman['Month']=rossman.Date.map(lambda x: x[5:7])
rossman['DayOfWeek']=rossman.DayOfWeek.map(lambda x: str(x))

max_comp_dist=stores.CompetitionDistance.max()
stores['CompetitionDistance'].loc[stores.CompetitionDistance.isnull()]=max_comp_dist
rossman=pd.merge(rossman,stores[['Store','StoreType','Assortment','CompetitionDistance']],on='Store')
rossman.head()

Out[8]:

	Store	DayOfWeek	Date	Sales	Customers	Open	Promo	SchoolHoliday	Year	Month	StoreType	Assortment	CompetitionDistance
0	1	5	2015-07-31	5263	555	1	1	1	2015	07	c	a	1270.0
1	1	4	2015-07-30	5020	546	1	1	1	2015	07	c	a	1270.0
2	1	3	2015-07-29	4782	523	1	1	1	2015	07	c	a	1270.0
3	1	2	2015-07-28	5011	560	1	1	1	2015	07	c	a	1270.0
4	1	1	2015-07-27	6102	612	1	1	1	2015	07	c	a	1270.0

As a comparison point, I'll first fit a linear regression model, using the stores and store types as categorical variables in the model. Note that this is model is libnear in all parameters (no interactions, no polynomials, etc.), but as the dataset is very large, processing the data and fitting the model can still take a very long time. The models in the winning solutions for this competition took over a day to fit according to their authors.

In [9]:

%%time
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from scipy.sparse import coo_matrix, csr_matrix, hstack

y=rossman['Sales']
X=rossman[['Customers','Open','Promo','StateHoliday','SchoolHoliday','CompetitionDistance']]
Xcateg=rossman[['Store', 'DayOfWeek', 'StoreType','Assortment','Year','Month']]
Xcateg['Store']=Xcateg.Store.map(lambda x: str(x))
Xcateg=coo_matrix(pd.get_dummies(Xcateg).as_matrix())
X=hstack([pd.get_dummies(X).as_matrix(),Xcateg])
X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.33, random_state=100)

lr=Ridge()
lr.fit(csr_matrix(X_train),y_train)
preds_lr=lr.predict(X_test)
print(np.sqrt(np.mean((y_test-preds_lr)**2)))

2678.6297086
Wall time: 11min 23s

Now fitting a model with coefficients that vary by store - note that this is a far larger problem with over 40k model parameters, so it's no surprise that it takes a long time to fit and takes 4GB of ram memory. Nevertheless, this results in a huge improvement over the previous model:

In [10]:

%%time
from hierreg import HierarchicalRegression

y=rossman['Sales']
X=rossman[['DayOfWeek','Customers','Open','Promo','StateHoliday','SchoolHoliday','CompetitionDistance','Year','Month']]
group=rossman[['Store','StoreType','Assortment']]
X_train, X_test, y_train, y_test, group_train, group_test = train_test_split(
     pd.get_dummies(X), y, group, test_size=0.33, random_state=100)

## for larger datasets, casadi --> IPOPT can provide better running times, but only supports the default parameters
hlr=HierarchicalRegression(solver_interface='casadi')
hlr.fit(X_train,y_train,group_train)
preds_hlr1=hlr.predict(X_test,group_test)
print(np.sqrt(np.mean((y_test-preds_hlr1)**2)))

******************************************************************************
This program contains Ipopt, a library for large-scale nonlinear optimization.
 Ipopt is released as open source code under the Eclipse Public License (EPL).
         For more information visit http://projects.coin-or.org/Ipopt
******************************************************************************

673.711325158
Wall time: 7h 21min 50s