#!/usr/bin/env python # coding: utf-8 # # Example usage of hierreg # ** * # This IPython notebook tries to illustrate the usage of the [hierreg](https://github.com/david-cortes/hierreg) package, which fits linear models with coefficients that can vary according to ‘groups’ or levels to which observations belong, similar as random effects and hierarchical bayes models, but with deviations estimated by applying regularization on them. # # In a typical regression model, coefficients are estimated so as to minimize # $$ L(w) = \lVert y - Xw \lVert$$ # Where # * $X$ is the predictors'covariates matrix # * $y$ is the variable to predict # * $w$ are the model parameters (that minimize this metric) # # This model assumes that observations can belong to arbitrary categorical groups (e.g. observations coming from the same geographical region, same year, etc.), and the coefficients are different for all these groups, but not too different, by trying to minimize # # $$ L(w,v) = \lVert X(w+\sum_{groups} v_{group}I_{[x \in group]}) - y\lVert +\:\: \lambda *\lVert groupweights *v \lVert $$ # Where # * $X$ is the predictors'covariates matrix # * $y$ is the variable to predict # * $\lambda$ is a regularization parameter # * $w$ are the main model parameters (like fixed effects) # * $v_{group}$ are deviations in the coefficients according to the group each observation belongs to # * $I_{[x \in group]}$ is an indicator column-matrix with ones when a row of $X$ belongs to a given group # * $groupweights$ is a weight vector making each group deviation have a weight that is inversely proportional to the number of observations beloning to that group. # # While this type of model doesn't offer the same inferential or statistical testing capabilities as random effects or hierarchical bayes, it can end up being better at making predictions than either alternatives, depending on the problem. # # (Logistic regression is also supported, the model needs to be called with argument `problem='classification'`) # # ** * # ### Package in action # # [1. Small example - Predicting student attainment](#example1) # # [2. Larger example - Predicting sales at different stores](#example2) # ** * # # ## 1. Small example - Predicting student attainment # # Here I'll try to build a model to predict student attaintment based on standardized test scores, family information and schools that children attended - the data was downloaded from here: # http://www.bristol.ac.uk/cmm/learning/mmsoftware/data-rev.html#lev-xc # # Description from the webpage # ##### _2LEV-XC_ # # _The data are on 3,435 children who attended 148 primary schools and 19 secondary schools in Scotland. There are 11 fields in the data set of which the following are used._ # # * _VRQ: A verbal reasoning score from tests pupils took when they entered secondary school_ # * _ATTAIN: Attainment score of pupils at age 16_ # * _PID: Primary school identifying code_ # * _SEX: Pupil's gender_ # * _0 = boy_ # * _1 = girl_ # * _SC: Pupil's social class scale (continuous scale score from low to high social class)_ # * _SID: Secondary school identifying code_ # # Taking a look at the data: # In[1]: import pandas as pd, numpy as np, warnings warnings.filterwarnings("ignore") df=pd.read_csv('2lev-xc.txt',names=['VRQ','ATTAIN','PID','SEX','SC','SID'],sep='\s',engine='python') df.head() # There are, overall, over a hundred primary schools and over a dozen secondary schools. For a dataset of this size, this is a very large amount of parameters: # In[2]: print(len(set(df.PID))) print(len(set(df.SID))) # Now fitting a model with school-varying coefficients: # In[3]: from hierreg import HierarchicalRegression from sklearn.model_selection import train_test_split y=df['ATTAIN'] X=df[['VRQ','SEX','SC']] groups=df[['PID','SID']] ## simple train-test split - note that I'm suing a random seed to get the same split again later X_train, X_test, y_train, y_test, groups_train, groups_test = train_test_split( X, y, groups, test_size=0.33, random_state=42) ## for small problems like this, solver 'ECOS' works faster than the default 'SCS' ## it would still work fine without chaning the default options though, only a bit slower ## note the large regularization parameter hr=HierarchicalRegression(l2_reg=1e4, cvxpy_opts={'solver':'ECOS'}) hr.fit(X_train, y_train, groups_train) print(np.mean((y_test-hr.predict(X_test, groups_test))**2)) # The choice of which variables to turn into groups with deviations in coefficients and which to use as categorical variables can be quite hard though - in this very simple example, a tuned regularized model with no group-variying coefficients ends up performing better: # In[4]: from sklearn.linear_model import Ridge y=df['ATTAIN'] X=df[['VRQ','SEX','SC','PID','SID']] X['PID']=X.PID.map(lambda x: str(x)) X['SID']=X.SID.map(lambda x: str(x)) X=pd.get_dummies(X) ## same train-test split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42) lr=Ridge(100).fit(X_train, y_train) print(np.mean((y_test-lr.predict(X_test))**2)) # (With larger datasets with more variables, this might not necessarily be the case though, as illustrated in the next example) # ** * # # ## 2. Larger example - Predicting sales at different stores # # This larger example tries to predict sales at different Rossman stores. The data comes from a [Kaggle competition](https://www.kaggle.com/c/rossmann-store-sales) and it's quite large compared to the previous dataset. This is no the typical time-series forecasting problem as the rows here include the number of customer who visited each store at each day, along with information about the days and about competitors: # In[5]: rossman=pd.read_csv('train.csv',engine='python') rossman['StateHoliday']=rossman.StateHoliday.map(lambda x: str(x)) # exlcuding days with no sales rossman=rossman.loc[rossman.Sales>0] rossman.head() # In[6]: rossman.shape # In[7]: stores=pd.read_csv('C:\\ipython\\mixed effects\\rossman sales\\store.csv') stores.head() # Doing some very basic feature engineering - there is a lot more that can be done, such as using time lags of sales and customers, and incorporating more information about competitors, among others, but the idea is just to illustrate the usage of this package. # In[8]: rossman['Year']=rossman.Date.map(lambda x: x[:4]) rossman['Month']=rossman.Date.map(lambda x: x[5:7]) rossman['DayOfWeek']=rossman.DayOfWeek.map(lambda x: str(x)) max_comp_dist=stores.CompetitionDistance.max() stores['CompetitionDistance'].loc[stores.CompetitionDistance.isnull()]=max_comp_dist rossman=pd.merge(rossman,stores[['Store','StoreType','Assortment','CompetitionDistance']],on='Store') rossman.head() # As a comparison point, I'll first fit a linear regression model, using the stores and store types as categorical variables in the model. Note that this is model is libnear in all parameters (no interactions, no polynomials, etc.), but as the dataset is very large, processing the data and fitting the model can still take a very long time. The models in the winning solutions for this competition took over a day to fit according to their authors. # In[9]: get_ipython().run_cell_magic('time', '', "from sklearn.linear_model import Ridge\nfrom sklearn.model_selection import train_test_split\nfrom scipy.sparse import coo_matrix, csr_matrix, hstack\n\ny=rossman['Sales']\nX=rossman[['Customers','Open','Promo','StateHoliday','SchoolHoliday','CompetitionDistance']]\nXcateg=rossman[['Store', 'DayOfWeek', 'StoreType','Assortment','Year','Month']]\nXcateg['Store']=Xcateg.Store.map(lambda x: str(x))\nXcateg=coo_matrix(pd.get_dummies(Xcateg).as_matrix())\nX=hstack([pd.get_dummies(X).as_matrix(),Xcateg])\nX_train, X_test, y_train, y_test = train_test_split(\n X, y, test_size=0.33, random_state=100)\n\nlr=Ridge()\nlr.fit(csr_matrix(X_train),y_train)\npreds_lr=lr.predict(X_test)\nprint(np.sqrt(np.mean((y_test-preds_lr)**2)))\n") # Now fitting a model with coefficients that vary by store - note that this is a far larger problem with over 40k model parameters, so it's no surprise that it takes a long time to fit and takes 4GB of ram memory. Nevertheless, this results in a huge improvement over the previous model: # In[10]: get_ipython().run_cell_magic('time', '', "from hierreg import HierarchicalRegression\n\ny=rossman['Sales']\nX=rossman[['DayOfWeek','Customers','Open','Promo','StateHoliday','SchoolHoliday','CompetitionDistance','Year','Month']]\ngroup=rossman[['Store','StoreType','Assortment']]\nX_train, X_test, y_train, y_test, group_train, group_test = train_test_split(\n pd.get_dummies(X), y, group, test_size=0.33, random_state=100)\n\n## for larger datasets, casadi --> IPOPT can provide better running times, but only supports the default parameters\nhlr=HierarchicalRegression(solver_interface='casadi')\nhlr.fit(X_train,y_train,group_train)\npreds_hlr1=hlr.predict(X_test,group_test)\nprint(np.sqrt(np.mean((y_test-preds_hlr1)**2)))\n")