`pysal.spreg`

: "spatial regression like a Pro"`<[email protected]>`

In this session, we will walk through most of the functionality that `pysal`

has to offer when it comes to spatial regression. We will repeat the exercise we have previously seen in GeoDaSpace using PySAL entirely. In the process, we will discover some of the details that GeoDaSpace misses in order to have an intuitive interface and, by showing some extra tricks, will hopefully convince you of the extra power and flexibility that the command line offers at the cost of a slightly steeper learning curve. The three parts in which we will split the session are the following:

- Load up the data into Python
- Replication of the regression analysis in GeoDaSpace
- Some extra tricks you can access if you use
`pysal.spreg`

This is the very first step you need to go through in order to perform any data analysis with Python. It is the equivalent to load up a dbf of a csv file in GeoDaSpace, onlyt that instead of clicking on a GUI, we will run commands. Before anything, we have to bring `pysal`

to our sesion. To make it shorter to type, we will import it with the alias `ps`

. Because ultimately, `pysal`

takes NumPy arrays, we will also import numpy, shortening with the common `np`

:

In [1]:

```
import pysal as ps
import numpy as np
```

`csv`

file under the path `../data_workshop/phx.csv`

. Make sure that you have navigated to the right folder or pass the full path of the file (alternatively, you can use relative paths as well, as we are doing in this example). We will load them in memory by calling the command open in `pysal`

, which mirrors the general file I/O in Python:

In [2]:

```
db = ps.open('../workshop_data/phx.csv')
```

`db`

which contains all our data. We can explore it a little bit to get a feel of what it contains:

In [3]:

```
len(db) #How many observations
```

Out[3]:

In [4]:

```
db.header #Names of the columns
```

Out[4]:

In [5]:

```
db[0, :] #First row
```

Out[5]:

In [6]:

```
db[0:5, 0] #First six elements of the first column
```

Out[6]:

`pysal`

has a very handy utility for the task, which will return a list with the elements of the columns. We can also get some basic statistics using NumPy:

In [7]:

```
pdens = db.by_col('pop_dens')
pdens[0:5] #First six elements of the population density column
```

Out[7]:

In [8]:

```
min(pdens) #Minimum value
```

Out[8]:

In [9]:

```
max(pdens) #Maximum value
```

Out[9]:

In [10]:

```
np.mean(pdens) #Average value
```

Out[10]:

In [11]:

```
np.var(pdens) #Variance
```

Out[11]:

`spreg`

, we need to convert the data we will be using in our models into NumPy arrays, which are efficient data structures that allow very good performance for matrix operations. Also, these arrays need to be 2D so we will reshape them when necessary. Let us use the log of the `pct_error`

as the dependent variable (y) and the same list of variables as in the GeoDaSpace example for the regressors:

In [12]:

```
y = np.array(db.by_col('l_pct_err')).reshape((len(db), 1))
x_names = ['hsu', 'pop_dens', 'white_rt', 'black_rt', 'hisp_rt', \
'fem_nh_rt', 'renter_rt', 'vac_hsu_rt']
x = np.array([db.by_col(var) for var in x_names]).T
```

`knn`

weights with six neighbors
(note you might get a warning because it does not find IDs matching in the shapefile in the folder).
We also row-standardize the matrix so every row sums to one and the spatial lag can be interpreted
as the average value of the neighbors.

In [13]:

```
w = ps.open('../workshop_data/phx_knn06.gwt').read()
w.transform = 'R'
```

We are all set to start the regression analysis. As a benchmark, we will begin with a base model using the command `OLS`

. We will also pass in a weights object and set the flag so we obtain also spatial diagnostics about the residuals.

In [14]:

```
from pysal.spreg import OLS
model = OLS(y, x, w=w, name_x=x_names, spat_diag=True)
print model.summary
```

At first sight, it is not too bad: we are explaining over 30% of the variation in the dependent variable and the F test reveals some significance of the coefficients overall. However, before we consider the coefficients, it is important to check that the model we have just run is correct and that OLS is a good estimation procedure for this dataset. There are several parts in this output report that should rise a red flag when we look at them.

In first place, the `SPECIFICATION ROBUST TEST`

(White test) section has not been computed due to multicolinearity. If we look at the multicolinearity condition number we see it is almost 46. This is high, in fact beyond 30 which is the limit at which `pysal.spreg`

stops computing the White test. However, it is not terribly high (although the rule of thumb is 30, below 100 may still be feasible) so for the time being, we will let it be.

The assumption of normality of the residuals is one of the most important ones in linear regression when the size of your sample is limited. However, as `n`

grows, this is less relevant due to the central limit theorem (CLM). Since we have almost 1000 observations, we assume we can rely on the CLT.

Two other problems still remain, heteroskedasticity and spatial dependence, and it is on them that we will focus the most. Both diagnostics against heteroskedasticity clearly point to the presence of the problem. One traditional way to tackle this issue is by estimating a robust VC matrix following White's procedure. Let us do just that before we get our hands dirty with any spatial issue. Since we have already seen them, let us skip the computation of the diagnostics in this run:

In [15]:

```
from pysal.spreg import OLS
model = OLS(y, x, w=w, name_x=x_names, spat_diag=False, nonspat_diag=False, robust='white')
print model.summary
```

`fem_nh_rt`

) becomes insignificant at the 1% level.

At this point, we are ready to consider the issue of spatial spatial dependence and explicitly incorporate space into our model. In order to better focus on them, we will print the part of the regression output that concerns only to the spatial diagnostics. This is a somewhat hidden feature, but it can prove very useful in some contexts, for instance when you are considering several types of weights on the same model. Mind that this trick requires you to re-run the OLS every time so it is not very efficient in terms of speed up (for that, see below in the extra tricks section). However, it comes in very handy to inspect the results.

In [16]:

```
from pysal.spreg.user_output import summary_spat_diag
model = OLS(y, x, w=w, name_x=x_names, spat_diag=True)
spd = summary_spat_diag(model, None, None)
print spd
```

The first thing we can observe is that the non-robust versions of the LM tests all point to severe presence of spatial autocorrelation. The problem of having both tests (error and lag) rejecting the null is that they can be misleading at the presence of the other specification. In other words, if the true DGP contains say an error structure, the result of the LM-Lag is affected and viceversa. For that reason, we have to turn to the robust versions of the LM which, as they name reads, are robust to the presence of the other model. At this point, things become clearer: while the robust version of the LM-Lag is still significant, the error counterpart is not.

Before we move into the right direction, let us assume that we did not look at the robust versions and decided to try first with one of the simplest spatial specifications: the error model. This is very easy to run in PySAL thanks to the method `GM_Error`

, which implements the traditional Kelejan and Prucha 1998/99 papers:

In [17]:

```
from pysal.spreg import GM_Error
error_kp98 = GM_Error(y, x, w, name_x=x_names)
print error_kp98.summary
```

Note there is a change in the output print with respect to the OLS. Instead of R squared, we now have the *pseudo* R
squared. This is simply the correlation coefficient between the dependent variable (`y`

) and the predicted values of our
model. Although it is not strictly comparable to the traditional one, it is still a good guidance of *how right* our model
performs.

One inconvenient of this procedure is that it only obtains a point estimate of the spatial parameter
and hence does not allow to do inference on it. In a recent paper, Drukker et al. (2010) present an improvement
on the techinique that allows to obtain inference even on the spatial parameter. PySAL implements it in the
method `GM_Error_Hom`

, where the `Hom`

is named after the assumption of homoskedasticity made in the model.

In [18]:

```
from pysal.spreg import GM_Error_Hom
error_hom = GM_Error_Hom(y, x, w, name_x=x_names)
print error_hom.summary
```

`lambda`

.
Apparently in contrast with what we would expect from the LM tests on spatial autocorrelation, the coefficient appears
highly significant. We will provide an explanation for this below when we delve into more complicated models but, before
that, let us present a third estimation procedure implemented in PySAL for a spatial autoregressive error process. If we
remember from the OLS output print, there were significant signs of heteroskedasticity, and the error models we have fit
so far assume homoskedasticity. In a very recent reference, Arraiz et al. (2010) propose a procedure to estimate error models
that is consistent to the presence of heteroskedasticity. PySAL implements this in the `spreg`

module `error_sp_het`

and it is
as simple to use as the rest we have seen:

In [19]:

```
from pysal.spreg import GM_Error_Het
error_het = GM_Error_Het(y, x, w, name_x=x_names)
print error_het.summary
```

This procedure only differs from `GM_Error_Hom`

in the moments and VC matrix, so all the coefficients except for
the spatial parameter remain unchanged. The standard errors however slightly change.

Imagine that, after examining the LM tests, we had decided for a lag model instead of the error.
Running a spatial lag model is equally simple in PySAL thanks to the method GM_Lag. Note how, in this case,
`w`

has to be referenced. This is because we might be passing in additional instruments and hence it is not
clear that the third place is for the weights object.

In [20]:

```
from pysal.spreg import GM_Lag
lag_model = GM_Lag(y, x, w=w, name_x=x_names, spat_diag=True)
print lag_model.summary
```

As we can see, the parameter for the lag of the dependent variable is positive and significant, similar to the error case.
A new addition to the output print is the *spatial pseudo R squared*. In the lag model, we can obtain two different predicted
values: the *naive* ones and the complete ones. The former, used for the pseudo R squared, include only the first lag of the
dependent variable to predict the value of `y`

; the latter, used in the spatial pseudo R squared, employs the reduced form of
the model, thus incorporating all the feedback effects implicit in the model and offering a more correct estimate. In either case,
we can see how the lag model seems to do a better job with this dataset.

Note also that we asked for spatial diagnostics of the residuals. PySAL implements the Anselin-Kelejian (AK) test for residuals of an IV estimation. The test is a modification of the LM-Error and is used to check for remaining spatial autocorrelation in the residuals of a regresion with instruments. Consistent with the initial LM tests from above, the AK does not find any significant spatial correlation, pointing again to the lag model as the preferred one.

The estimation procedure that `GM_Lag`

employs is a particular case of the traditional Instrumental Variables (IV) approach,
in which the endogeneity of the spatial lag of the variable is dealt with by using instruments. Kelejian and Prucha (1998-99)
show that the optimal instruments for a lag model are the spatial lag of the exogenous variables, and `GM_Lag`

follows their
guidance. A less settled debate is how many orders of lags are optimal: just the first one (`WX`

)? two (`WX`

and `WWX`

)?
Very much like GeoDaSpace, PySAL uses only the first lag by default, but it allows you to modify it if you wish. Suppose we
preferred to include two lags, the call would then be:

In [21]:

```
lag_model2lags = GM_Lag(y, x, w=w, name_x=x_names, w_lags=2)
print lag_model2lags.summary
```

As mentioned before, this procedure is a particular case of IV estimation where the instruments are spatial.
To demonstrate how they are created, we will replicate what `GM_Lag`

builds internally and will pass it to the generic two
stages least squares method in `spreg`

. This will allow us
to discover some more functionality in PySAL and, at the same time, to check that we actually obtain the same numbers as with
the default.

We will try to match the default settings, so we will only need to compute the first lag of `X`

. PySAL has a generic function
(`lag_spatial`

) that will do the heavy lifting in a very efficient way for us, and it works both with single vectors or with matrices:

In [22]:

```
from pysal import lag_spatial
from pysal.spreg import TSLS
wy = lag_spatial(w, y) #Get the spatial lag of y
wx = lag_spatial(w, x) #Get the lag of x to use as instruments
iv_model = TSLS(y, x, yend=wy, q=wx, name_x=x_names)
print iv_model.summary
```

As you can see for yourself, the results of `iv_model`

exactly match those from `lag_model`

.

At this point, we have run a non-spatial model and found significant evidence of spatial autocorrelation;
we have then run lag and error models separately to try account for it. Although the robust LM tests pointed to a
lag model, we found significant the `lambda`

coefficient in the error model, so it would not be unrealistic to
think that a full model with both spatial lag and error would be a good fit. In this section we will show what
options are available in PySAL to construct more complicated models that include spatial effects in the dependent
variable as well as in the error term. In particular, we will use two main specifications: a SARAR model that models
autoregressive terms in both the lag and the error; and a combination of a lag model with the Spatial Heteroskedasticity
and Autocorrelation Consistent (SHAC) variance covariance matrix.

Our first option is to fit a model that introduces autoregressive parameters for the lag and the error. PySAL offers
combinations of the lag model we have seen with all the error models presented before. For the sake of simplicity, and
without loss of generality, we will stick to our last choice (`GM_Error_Het`

) to blend it with the lag, since it offers
the possibility of inference in the spatial parameter and accounts for heteroskedasticity, which we have present in the
dataset. These models are called `Combo`

in PySAL, to allude to the fact they combine a lag and an error.

In [23]:

```
from pysal.spreg import GM_Combo_Het
sarar_het = GM_Combo_Het(y, x, w=w, name_x=x_names)
print sarar_het.summary
```

Note how, once the spatial lag is introduced, the spatial parameter in the error term becomes insignificant.
We can now close the argument we started with the LM tests and seemed to counter-argue with the error model:
the best specification for this dataset appears to be a lag model; however, if a spatial error parameter is
introduced instead of a lag parameter, part of the spatial effects in the model are pushed to the error and
picked up by `lambda`

, which becomes significant for the lack of a better structure to model space (i.e. lag).
When such preferable structure is introduced and the spatial effects modelled properly, the error does not have
remaining spatial autocorrelation and the parameter becomes not significant.

The problem of the straight lag model is that it does not account for heteroskedasticity, which is a problem with our data. To solve the issue, we can use the White correction in the same way we did with OLS:

In [24]:

```
lag_white = GM_Lag(y, x, w=w, name_x=x_names, robust='white')
print lag_white.summary
```

Since White is a VC matrix correction, only the standard errors will be affected. Compared to the initial lag model,
standard errors are in general larger, which results in `fem_nh_rt`

becoming insignificant at the 1% level.

Now that we have our best model, we can compare the coefficients with the ones obtained previously to assess the effect
of the spatial effects present in the data on our estimations and conclusions. If we look at the estimates of `lag_white`

in comparison with either the OLS or error models, we find they tend to be smaller in the former. This means that the
effect of space makes models that do not account for it overestimate the importance of our regressors in explaining
the dependent variable. Just by looking at
the actual numbers, it might seem like the changes are not that large, but keep in mind the scale issues and the fact
that our dependent variable is expressed in logs. If you count both in, depending on the motivation of the model, the
changes may not seem that tiny.

As a final note for completeness, we will showcase one more model available in PySAL. As mentioned before, Arraiz et al. (2010)
present a nonparametric procedure to estimate the VC matrix that accounts for both heteroskedasticity *and* spatial
autocorrelation. Since it is a correction of the VC matrix and hence only affects the standard errors, we can use it with
either non-spatial models or with a spatial lag model. The call we need to make is not very different from that for White,
but before we can do that we need to create a kernel weights object, for which we will use 10 neighbors and, for this example,
a quadratic kernel function.

In [25]:

```
wk = ps.open('../workshop_data/phx_k_triangular.kwt', dataFormat='gwt').read()
lag_hac = GM_Lag(y, x, w=w, name_x=x_names, robust='hac', gwk=wk)
print lag_hac.summary
```

As it can be seen, the standard errors do not differ much from those obtained in `lag_white`

. This is because
the SHAC does as good of a job as the White at correcting for heteroskedasticity and, since residual spatial
autocorrelation is not an issue once the spatial lag is included, there is very little room left for the
SHAC to improve the White.

In order to complete the overview of `pysal.spreg`

, we have one more substantive feature to show. Each of the models
we have just used (non-spatial, spatial error, lag and combo) allow the user to pass not only exogenous regressors,
but also endogenous variables. This is implemented again using IV estimation, so any of the models we have just seen
may be turned into a two-stage least squares procedure in which you can allow some of you explanatory variables to be
endogenous and use other variables to instrument for them.

As a mere example, we will consider now that the variable population density (`pop_dens`

) is endogenous and that we
want to instrument for it using the inverse of land area. This means we have to re-factor our matrix `x`

and create two
new ones (which in this case will be of dimension n by 1): one for endogenous variables (`yend`

) and one for the
instruments (`q`

). We also can use this example to show a bit more how to manipulate data in a *pythonic* way. It takes
a bit of house-keeping, but it is not complicated:

In [26]:

```
#Pull out the X variables without including population density
x_names = ['hsu', 'white_rt', 'black_rt', 'hisp_rt', \
'fem_nh_rt', 'renter_rt', 'vac_hsu_rt']
x = np.array([db.by_col(var) for var in x_names]).T
#Create inverse of land area
area = np.array(db.by_col('ALAND10')).reshape((x.shape[0], 1))
inv_area = 1. / area
#Create the matrix for the instruments
q_names = ['inv_area']
q = inv_area
#Create the matrix for the endogenous variable
yend_names = ['pop_dens']
yend = np.array([db.by_col(var) for var in yend_names]).T
```

`pysal.spreg`

, showing you yet another way to access the code (note that some of the models include
now the name `Endog`

for endogeneity).

In [27]:

```
#Non-spatial model (w only required for spatial diagnostics
model = ps.spreg.TSLS(y, x, w=w, yend=yend, q=q, \
name_x=x_names, name_yend=yend_names, name_q=q_names)
#KP98-99 Error model
model = ps.spreg.GM_Endog_Error(y, x, w=w, yend=yend, q=q, \
name_x=x_names, name_yend=yend_names, name_q=q_names)
#Drukker et al. error model (Hom)
model = ps.spreg.GM_Endog_Error_Hom(y, x, w=w, yend=yend, q=q, \
name_x=x_names, name_yend=yend_names, name_q=q_names)
#Arraiz. et al. error model (Het)
model = ps.spreg.GM_Endog_Error_Het(y, x, w=w, yend=yend, q=q, \
name_x=x_names, name_yend=yend_names, name_q=q_names)
#Lag model
model = ps.spreg.GM_Lag(y, x, w=w, yend=yend, q=q, \
name_x=x_names, name_yend=yend_names, name_q=q_names)
#Combo model
model = ps.spreg.GM_Combo_Het(y, x, w=w, yend=yend, q=q, \
name_x=x_names, name_yend=yend_names, name_q=q_names)
```

As we have just shown, pysal.spreg can do everything that GeoDaSpace does; in fact, all the core functionality from GeoDaSpace is actually powered by the same code. At this point, you might be wondering why then make the investment to learn it this way, when you can click through the dialogs in GeoDaSpace. In this section, we will discuss some of the features that you can enjoy only if you use the command line and will demonstrate it using the spatial diagnostics we saw before as an example.

GeoDaSpace aims at being a simple, easy to use interface that puts all the advanced spatial econometrics implemented in PySAL one click away. This is partly done at the cost of setting some pysal.spreg options to reasonable defaults. In cases when we do need to change those defaults because the problem at hand requires some customization of estimation procedures, the command line is the way to go. In this document, we will not detail one by one all these options; instead, we will review how they can be found out and accessed.

PySAL takes documentation very seriously, and the spreg module is no exception. The inline help provided is so good that, with some basic Python knowledge at hand, it allows to explore all the functionality from the command line, right in place. The documentation has a more or less fixed structure that repeats throughout the modules:

- Basic description of the functionality.
- Parameters to be passed to the method.
- Attributes / output returned by the method.
- Examples of how to use the code.
- References to works cited before.

Accessing this information will detail every parameter and tweak that the code allows and it will tell us how to use it, so it is the best way to discover whether the customization we are after is implemented in PySAL. There are several ways to access the documentation, but two are the most preferred ones:

- Inline help from the Python interpreter: simply type
`help(pysal.function)`

and the documentation for`pysal.function`

will be displayed. Let us exemplify it with OLS:

In [28]:

```
help(ps.spreg.OLS)
```

As you can see, the output is significant and it details all the parameters you can tweak (e.g. sig2n_k).

- Online help: pysal.org hosts not only the code but an HTML version of the documentation. Very handy if we want to quickly google some function.

This may seem obvious but is not: the command line gives us programatic access to the code. This means that every action we perform can be saved (e.g. in a script) and that we no longer have to rely on what we clicked or did not click to know where the output comes from. Also, programatic access to the core of GeoDaSpace opens new options that the GUI does not offer: reproducibility (an important aspect of an open science) is much easier; we can use part of our code in bigger projects or re-use it over and over without re-doing the work for ourselves; we can now run it from a remote server (with more power than our desktop) and then use it for larger datasets, for example.

Probably one of the most useful options that the command line has to offer is the possibility to run the models we have seen in loops so we can use them as part of simulations or to quickly and efficiently try several specifications (e.g. different weights, variable combinations, etc.). As a very simple example of how to implement these ideas, we will walk through the following case: imagine that we want to see how different weights make results change in the spatial diagnostics we have shown before. To do that, we only need to run the OLS once because it is aspatial, but we have to compute the LM tests once for each weights specification. This is a perfect case to use a for loop, let us see how we would implement it (in order to see the results, we will also setup a little printout for ourselves):

In [29]:

```
#Import the LM tests method
from pysal.spreg import LMtests
#Specify the files for the weights we want to try as a list
w_files = ['../workshop_data/phx_knn06.gwt', \
'../workshop_data/phx_knn10.gwt', \
'../workshop_data/phx_rook.gal', \
'../workshop_data/phx_queen.gal']
#Run the OLS
model = ps.spreg.OLS(y, x, spat_diag=False, nonspat_diag=False)
#Setup the loop over the weights files
for w_file in w_files:
print 'File: ', w_file
w = ps.open(w_file).read()
lms = LMtests(model, w)
print '\tLM error: %.4f\t(%.4f)'%lms.lme
print '\tLM lag: %.4f\t(%.4f)'%lms.lml
print '\tSARMA: %.4f\t(%.4f)'%lms.sarma
print '\tRobust LM error: %.4f\t(%.4f)'%lms.rlme
print '\tRobust LM lag: %.4f\t(%.4f)'%lms.rlml
```

Note that to perform this, we only had to run the OLS model once, gaining some speed up that becomes very important if the size of the dataset is very large.

This is only a very simple example of how scripting and programatic access can be a powerful element to help us in our analysis. With these tools, the possibilities of building fairly complicated analysis or systems become not only feasible, but reachable. Feel free to tinker and hack with pysal.spreg as much as you want, the limit is the sky!