%matplotlib inline
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
pd.options.display.max_rows = 10
df = (sns.load_dataset('titanic')
.dropna(subset=['survived', 'age']))
df.head()
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
1 | 1 | 1 | female | 38 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
2 | 1 | 3 | female | 26 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
3 | 1 | 1 | female | 35 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
4 | 0 | 3 | male | 35 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
mod = sm.Logit.from_formula('survived ~ age', df)
res = mod.fit()
me = res.get_margeff()
Optimization terminated successfully. Current function value: 0.672429 Iterations 4
res.summary()
Dep. Variable: | survived | No. Observations: | 714 |
---|---|---|---|
Model: | Logit | Df Residuals: | 712 |
Method: | MLE | Df Model: | 1 |
Date: | Fri, 11 Sep 2015 | Pseudo R-squ.: | 0.004445 |
Time: | 08:16:34 | Log-Likelihood: | -480.11 |
converged: | True | LL-Null: | -482.26 |
LLR p-value: | 0.03839 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -0.0567 | 0.174 | -0.327 | 0.744 | -0.397 | 0.283 |
age | -0.0110 | 0.005 | -2.057 | 0.040 | -0.021 | -0.001 |
me.summary()
Dep. Variable: | survived |
---|---|
Method: | dydx |
At: | overall |
dy/dx | std err | z | P>|z| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
age | -0.0026 | 0.001 | -2.081 | 0.037 | -0.005 |
Using Greene's notation here (~page 736). We've got a few things
where $\Lambda$ is the logistic CDF
We want a prediction interval for the predicted probability around the CEF. We have
$Asy. Var[\hat{F}] = [\partial \hat{F} / \partial \hat{\beta}]\bf{V} [\partial \hat{F} / \partial \hat{\beta}]$
And the vector of derivatives is
$[\partial \hat{F} / \partial \hat{\beta}] = [d\hat{F}/dz][\partial z / \partial \hat{\beta}] = \hat{f} x$
where $\hat{f} = \hat{\Lambda}(1 - \hat{\Lambda})$ (the logistic pdf I beleive).
And so
$Asy.Var[\hat{F}] = \hat{f}^2 x^T\bf{V}x$
Note that the $x$ here is a single row from the matrix. But vectors are always column vectors so $x$ is $Kx1$, where $K$ is the number of predictors (2 for us).
from scipy import stats
Λ = lambda x: stats.logistic().cdf(x)
λ = lambda x: stats.logistic().pdf(x)
β_ = res.params.values.reshape(-1, 1)
V_ = res.cov_params().values
# this is the equation from above for Var(F)
# only works for a single observation though
def var_π(x, β, V_):
# λ(z)**s * x.T @ V_ @ x
prob = λ(x.T.dot(β))**2 * x.T.dot(V_).dot(x) # maybe this is it?
return prob
# for multiple observations, but not quite vectorized fully
def var_πs(xx, β, V_):
α = λ(xx.dot(β))**2
out = np.empty((500, 1))
for i, x in enumerate(xx):
out[i] = x.T.dot(V_).dot(x)
return α * out
# Making some fake data.
xx = sm.add_constant(np.linspace(df.age.min(), df.age.max(), 500).reshape(-1, 1))
πs = Λ(xx.dot(β_))
vv = np.sqrt(var_πs(xx, β_, V_))
plt.plot(xx[:, 1], πs)
plt.fill_between(xx[:, 1], (πs - 1.96*vv).ravel(), (πs + 1.96*vv).ravel(),
alpha=.25, color='g')
<matplotlib.collections.PolyCollection at 0x108b62438>