This IPython notebook tries to illustrate the usage of the hierreg package, which fits linear models with coefficients that can vary according to ‘groups’ or levels to which observations belong, similar as random effects and hierarchical bayes models, but with deviations estimated by applying regularization on them.
In a typical regression model, coefficients are estimated so as to minimize $$ L(w) = \lVert y - Xw \lVert$$ Where
This model assumes that observations can belong to arbitrary categorical groups (e.g. observations coming from the same geographical region, same year, etc.), and the coefficients are different for all these groups, but not too different, by trying to minimize
$$ L(w,v) = \lVert X(w+\sum_{groups} v_{group}I_{[x \in group]}) - y\lVert +\:\: \lambda *\lVert groupweights *v \lVert $$Where
While this type of model doesn't offer the same inferential or statistical testing capabilities as random effects or hierarchical bayes, it can end up being better at making predictions than either alternatives, depending on the problem.
(Logistic regression is also supported, the model needs to be called with argument problem='classification'
)
1. Small example - Predicting student attainment
2. Larger example - Predicting sales at different stores
Here I'll try to build a model to predict student attaintment based on standardized test scores, family information and schools that children attended - the data was downloaded from here: http://www.bristol.ac.uk/cmm/learning/mmsoftware/data-rev.html#lev-xc
Description from the webpage
The data are on 3,435 children who attended 148 primary schools and 19 secondary schools in Scotland. There are 11 fields in the data set of which the following are used.
Taking a look at the data:
import pandas as pd, numpy as np, warnings
warnings.filterwarnings("ignore")
df=pd.read_csv('2lev-xc.txt',names=['VRQ','ATTAIN','PID','SEX','SC','SID'],sep='\s',engine='python')
df.head()
VRQ | ATTAIN | PID | SEX | SC | SID | |
---|---|---|---|---|---|---|
0 | 11.0 | 10.0 | 1.0 | 0 | 0.0 | 9.0 |
1 | 0.0 | 3.0 | 1.0 | 1 | 0.0 | 9.0 |
2 | -14.0 | 2.0 | 1.0 | 0 | 0.0 | 9.0 |
3 | -6.0 | 3.0 | 1.0 | 0 | 20.0 | 9.0 |
4 | -30.0 | 2.0 | 1.0 | 1 | 0.0 | 9.0 |
There are, overall, over a hundred primary schools and over a dozen secondary schools. For a dataset of this size, this is a very large amount of parameters:
print(len(set(df.PID)))
print(len(set(df.SID)))
148 19
Now fitting a model with school-varying coefficients:
from hierreg import HierarchicalRegression
from sklearn.model_selection import train_test_split
y=df['ATTAIN']
X=df[['VRQ','SEX','SC']]
groups=df[['PID','SID']]
## simple train-test split - note that I'm suing a random seed to get the same split again later
X_train, X_test, y_train, y_test, groups_train, groups_test = train_test_split(
X, y, groups, test_size=0.33, random_state=42)
## for small problems like this, solver 'ECOS' works faster than the default 'SCS'
## it would still work fine without chaning the default options though, only a bit slower
## note the large regularization parameter
hr=HierarchicalRegression(l2_reg=1e4, cvxpy_opts={'solver':'ECOS'})
hr.fit(X_train, y_train, groups_train)
print(np.mean((y_test-hr.predict(X_test, groups_test))**2))
4.503111161611763
The choice of which variables to turn into groups with deviations in coefficients and which to use as categorical variables can be quite hard though - in this very simple example, a tuned regularized model with no group-variying coefficients ends up performing better:
from sklearn.linear_model import Ridge
y=df['ATTAIN']
X=df[['VRQ','SEX','SC','PID','SID']]
X['PID']=X.PID.map(lambda x: str(x))
X['SID']=X.SID.map(lambda x: str(x))
X=pd.get_dummies(X)
## same train-test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
lr=Ridge(100).fit(X_train, y_train)
print(np.mean((y_test-lr.predict(X_test))**2))
4.386018754899887
(With larger datasets with more variables, this might not necessarily be the case though, as illustrated in the next example)
This larger example tries to predict sales at different Rossman stores. The data comes from a Kaggle competition and it's quite large compared to the previous dataset. This is no the typical time-series forecasting problem as the rows here include the number of customer who visited each store at each day, along with information about the days and about competitors:
rossman=pd.read_csv('train.csv',engine='python')
rossman['StateHoliday']=rossman.StateHoliday.map(lambda x: str(x))
# exlcuding days with no sales
rossman=rossman.loc[rossman.Sales>0]
rossman.head()
Store | DayOfWeek | Date | Sales | Customers | Open | Promo | StateHoliday | SchoolHoliday | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 5 | 2015-07-31 | 5263 | 555 | 1 | 1 | 0 | 1 |
1 | 2 | 5 | 2015-07-31 | 6064 | 625 | 1 | 1 | 0 | 1 |
2 | 3 | 5 | 2015-07-31 | 8314 | 821 | 1 | 1 | 0 | 1 |
3 | 4 | 5 | 2015-07-31 | 13995 | 1498 | 1 | 1 | 0 | 1 |
4 | 5 | 5 | 2015-07-31 | 4822 | 559 | 1 | 1 | 0 | 1 |
rossman.shape
(844338, 9)
stores=pd.read_csv('C:\\ipython\\mixed effects\\rossman sales\\store.csv')
stores.head()
Store | StoreType | Assortment | CompetitionDistance | CompetitionOpenSinceMonth | CompetitionOpenSinceYear | Promo2 | Promo2SinceWeek | Promo2SinceYear | PromoInterval | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | c | a | 1270.0 | 9.0 | 2008.0 | 0 | NaN | NaN | NaN |
1 | 2 | a | a | 570.0 | 11.0 | 2007.0 | 1 | 13.0 | 2010.0 | Jan,Apr,Jul,Oct |
2 | 3 | a | a | 14130.0 | 12.0 | 2006.0 | 1 | 14.0 | 2011.0 | Jan,Apr,Jul,Oct |
3 | 4 | c | c | 620.0 | 9.0 | 2009.0 | 0 | NaN | NaN | NaN |
4 | 5 | a | a | 29910.0 | 4.0 | 2015.0 | 0 | NaN | NaN | NaN |
Doing some very basic feature engineering - there is a lot more that can be done, such as using time lags of sales and customers, and incorporating more information about competitors, among others, but the idea is just to illustrate the usage of this package.
rossman['Year']=rossman.Date.map(lambda x: x[:4])
rossman['Month']=rossman.Date.map(lambda x: x[5:7])
rossman['DayOfWeek']=rossman.DayOfWeek.map(lambda x: str(x))
max_comp_dist=stores.CompetitionDistance.max()
stores['CompetitionDistance'].loc[stores.CompetitionDistance.isnull()]=max_comp_dist
rossman=pd.merge(rossman,stores[['Store','StoreType','Assortment','CompetitionDistance']],on='Store')
rossman.head()
Store | DayOfWeek | Date | Sales | Customers | Open | Promo | StateHoliday | SchoolHoliday | Year | Month | StoreType | Assortment | CompetitionDistance | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 5 | 2015-07-31 | 5263 | 555 | 1 | 1 | 0 | 1 | 2015 | 07 | c | a | 1270.0 |
1 | 1 | 4 | 2015-07-30 | 5020 | 546 | 1 | 1 | 0 | 1 | 2015 | 07 | c | a | 1270.0 |
2 | 1 | 3 | 2015-07-29 | 4782 | 523 | 1 | 1 | 0 | 1 | 2015 | 07 | c | a | 1270.0 |
3 | 1 | 2 | 2015-07-28 | 5011 | 560 | 1 | 1 | 0 | 1 | 2015 | 07 | c | a | 1270.0 |
4 | 1 | 1 | 2015-07-27 | 6102 | 612 | 1 | 1 | 0 | 1 | 2015 | 07 | c | a | 1270.0 |
As a comparison point, I'll first fit a linear regression model, using the stores and store types as categorical variables in the model. Note that this is model is libnear in all parameters (no interactions, no polynomials, etc.), but as the dataset is very large, processing the data and fitting the model can still take a very long time. The models in the winning solutions for this competition took over a day to fit according to their authors.
%%time
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from scipy.sparse import coo_matrix, csr_matrix, hstack
y=rossman['Sales']
X=rossman[['Customers','Open','Promo','StateHoliday','SchoolHoliday','CompetitionDistance']]
Xcateg=rossman[['Store', 'DayOfWeek', 'StoreType','Assortment','Year','Month']]
Xcateg['Store']=Xcateg.Store.map(lambda x: str(x))
Xcateg=coo_matrix(pd.get_dummies(Xcateg).as_matrix())
X=hstack([pd.get_dummies(X).as_matrix(),Xcateg])
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=100)
lr=Ridge()
lr.fit(csr_matrix(X_train),y_train)
preds_lr=lr.predict(X_test)
print(np.sqrt(np.mean((y_test-preds_lr)**2)))
2678.6297086 Wall time: 11min 23s
Now fitting a model with coefficients that vary by store - note that this is a far larger problem with over 40k model parameters, so it's no surprise that it takes a long time to fit and takes 4GB of ram memory. Nevertheless, this results in a huge improvement over the previous model:
%%time
from hierreg import HierarchicalRegression
y=rossman['Sales']
X=rossman[['DayOfWeek','Customers','Open','Promo','StateHoliday','SchoolHoliday','CompetitionDistance','Year','Month']]
group=rossman[['Store','StoreType','Assortment']]
X_train, X_test, y_train, y_test, group_train, group_test = train_test_split(
pd.get_dummies(X), y, group, test_size=0.33, random_state=100)
## for larger datasets, casadi --> IPOPT can provide better running times, but only supports the default parameters
hlr=HierarchicalRegression(solver_interface='casadi')
hlr.fit(X_train,y_train,group_train)
preds_hlr1=hlr.predict(X_test,group_test)
print(np.sqrt(np.mean((y_test-preds_hlr1)**2)))
****************************************************************************** This program contains Ipopt, a library for large-scale nonlinear optimization. Ipopt is released as open source code under the Eclipse Public License (EPL). For more information visit http://projects.coin-or.org/Ipopt ****************************************************************************** 673.711325158 Wall time: 7h 21min 50s