This notebook demonstrates several methods for survival analysis that are available in Statsmodels, using the NHANES III study data.

In [1]:

```
import statsmodels
import numpy as np
import pandas as pd
import statsmodels.api as sm
import os
import matplotlib.pyplot as plt
%matplotlib inline
```

NHANES is a cross sectional study (subjects are assessed at one time point and are never re-assessed by NHANES). The NHANES study team periodically uses social security and other sources of death records to determine which study subjects are still alive at a given follow up point. Here we will use the mortality data from 2011. People who are died prior to 2011 should have a date of death in these mortality files. People who are alive in 2011 will be considered to be censored in the survival analyses.

The mortality data are available at the link below as a fixed width format (FWF) file. The file to retrieve is called "NHANES_III_MORT_2011_PUBLIC.dat".

http://www.cdc.gov/nchs/data-linkage/mortality-public.htm

We downloaded the mortality data and gzipped it. To read a FWF file we need a specification defining which values appear in each column of the data file. This is contained in the SAS script provided with the data:

ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/datalinkage/linked_mortality/SAS-Read-in-Program-All-Surveys.sas

Copy lines 119 to 141 from the SAS program and place it into a text file called "mortality_fields.txt". The following utility function "get_colspecs" will read this file and use it to produce a list of start/end points for the pandas `read_fwf`

function to use when reading the mortality data file.

In [2]:

```
def get_colspecs(fname):
fields = []
fid = open(fname)
columns = []
colspecs = []
for line in fid:
v = line.rstrip().split()
columns.append(v[0])
u = v[1].split("-")
if len(u) == 1:
u = [int(u[0]), int(u[0])]
else:
u = [int(u[0]), int(u[1])]
u[0] -= 1
colspecs.append(u)
return columns, colspecs
```

Now we are ready to read the mortality information:

In [3]:

```
columns, colspecs = get_colspecs("mortality_fields.txt")
fname = "NHANES_III_MORT_2011_PUBLIC.dat.gz"
mort = pd.read_fwf(fname, colspecs=colspecs, header=None, names=columns,
compression="gzip")
```

The data dictionary for the mortality data is here:

http://www.cdc.gov/nchs/data/datalinkage/public_use_data_dictionary_11_17_2015.pdf

The only variables we will use here are "PERMTH_INT", which is the time duration from the NHANES clinical assessment to the last follow up time, and "MORTSTAT", which indicates whether the follow up interval ends with death or censoring. We will also need the "SEQN" variable to erge the mortality and assessment data. Therefore we wil drop the other columns, and drop the cases with missing data on these key variables.

In [4]:

```
mort = mort[["MORTSTAT", "PERMTH_INT", "SEQN"]].dropna()
```

We will also need the data from the NHANES 3 assessment. The data and codebooks are available here:

http://www.cdc.gov/nchs/nhanes/nh3data.htm

In the analyses below we will use the adult data. As with the mortality data considered above, this is a fixed-width format file. The column specifications are given in the SAS file that can be downloaded from the above link.

In [5]:

```
columns, colspecs = get_colspecs("adult_fields.txt")
fname = "adult.dat.gz"
data = pd.read_fwf(fname, colspecs=colspecs, header=None, names=columns,
compression="gzip")
```

Now we merge the mortality and visit data.

In [6]:

```
df = pd.merge(data, mort, left_on="SEQN", right_on="SEQN")
print(mort.shape)
print(data.shape)
print(df.shape)
```

Next we drop the columns that we will not use below, and drop cases with missing data.

In [7]:

```
df = df[["PERMTH_INT", "MORTSTAT", "HSAGEIR", "HSSEX", "HAZA8AK1", "SEQN"]].dropna()
```

Just for orientation, a histogram of age values is shown below.

In [8]:

```
plt.hist(df.HSAGEIR, color='grey')
plt.xlabel("Age (years)", size=16)
plt.ylabel("Frequency", size=16)
```

Out[8]:

We can now create a survival function object. To do this we need to provide the duration values ("PERMTH_INT") and the mortality status variable ("MORTSTAT"). The "summary" method of the survival function returns a dataframe containing the Kaplan-Meier estimates of the survival function at its steps. For any survival analysis it is important to define time zero in a meaningful way. There are various ways to do this here, but one approach suitable here is to select people in a narrow age range, say 50-55. This means we are looking at the survival function for one "birth cohort" -- people who were roughly age 52.5 when the NHANES III assessmsents took place.

In [9]:

```
ii = (df.HSAGEIR >= 50) & (df.HSAGEIR < 55)
df = df.loc[ii]
```

In [10]:

```
sf1 = sm.duration.SurvfuncRight(df["PERMTH_INT"], df["MORTSTAT"], "")
sf1.summary().head()
```

Out[10]:

The Kaplan-Meier estimate is the most common estimator of the survival function. We plot that next. The "+" signs indicate censoring. Since NHANES III took place in the 1980's, all subjects have at least 12 years of follow-up as of 2011.

In [11]:

```
sf1.plot()
plt.grid(True)
plt.ylim(0.6, 1)
plt.xlabel("Time (months)", size=16)
plt.ylabel("Proportion alive", size=16)
```

Out[11]:

Next we plot the survival function together with point-wise standard errors. . Since the NHANES sample size is large, the standard errors are small. This means that we have estimated the survival function quite precisely.

In [12]:

```
plt.grid(True)
plt.fill_between(sf1.surv_times, sf1.surv_prob - 2*sf1.surv_prob_se, sf1.surv_prob + 2*sf1.surv_prob_se, color="lightgrey")
plt.plot(sf1.surv_times, sf1.surv_prob, '-', color='black')
plt.ylim(0.7, 1)
plt.xlabel("Time (months)", size=16)
plt.ylabel("Proportion alive", size=16)
```

Out[12]:

A more rigorous way to assess the uncertainty in the survival function estimate is to use simultaneous confidence bands rather than pointwise confidence bands. There are several methods in the literature for doing this. The standard methods do not give very meaningful results for survival times close to the extreme values of oberved durations (0 and 270 here).

In [13]:

```
lcb, ucb = sf1.simultaneous_cb(transform='arcsin')
plt.grid(True)
plt.plot(sf1.surv_times, sf1.surv_prob, '-', color='black')
plt.fill_between(sf1.surv_times, lcb, ucb, color='lightgrey')
```

Out[13]:

The inverse of the survival function is the survival quantile function. It tells us the survival time corresponding to each probability point (e.g. what is the 10th percentile of survival times). These estimates may be undefined for probabiity points that are "not reached". For example, in the overall NHANES sample the median survival time is not reached.

The first plot below shows the survival quantles up to the 20th percentile, which is the highest probability point that is reached in these data.

In [14]:

```
qp = np.linspace(0.01, 0.19, 20)
sq = [sf1.quantile(x) for x in qp]
plt.clf()
plt.grid(True)
plt.plot(qp, sq, '-')
plt.xlabel("Probability", size=16)
plt.ylabel("Survival quantile", size=16)
```

Out[14]:

We can put pointwise confidence bands around the survival quantile curves.

In [15]:

```
ci = [sf1.quantile_ci(x) for x in qp]
ci = np.asarray(ci)
plt.clf()
plt.grid(True)
plt.fill_between(qp, ci[:, 0], ci[:, 1], color='lightgrey')
plt.plot(qp, sq, '-', color='black')
plt.xlabel("Probability", size=16)
plt.ylabel("Survival quantile", size=16)
```

Out[15]:

To illustrate the methods for comparing groups, we will compare survival curves for women and men. The gender information is in the "HSSEX" variable. HSSEX=1 indicates males and HSSEX=2 indicates females.

In [16]:

```
ii = (df.HSSEX==1)
sfm = sm.duration.SurvfuncRight(df.loc[ii, "PERMTH_INT"], df.loc[ii, "MORTSTAT"], "Male")
ii = (df.HSSEX==2)
sff = sm.duration.SurvfuncRight(df.loc[ii, "PERMTH_INT"], df.loc[ii, "MORTSTAT"], "Female")
```

We can now plot the survival function estimates for males and females (in the 50-60 age range) on the same axes.

In [17]:

```
ax = plt.gca()
plt.grid(True)
sfm.plot(ax)
sff.plot(ax)
plt.xlabel("Time (months)", size=16)
plt.ylabel("Proportion alive", size=16)
plt.ylim(0.5, 1)
ha, lb = ax.get_legend_handles_labels()
plt.legend(ha, lb, loc="lower left")
```

Out[17]:

In [18]:

```
chisq, pval = sm.duration.survdiff(df.PERMTH_INT, df.MORTSTAT, df.HSSEX)
print(chisq, pval)
```

To move beyond group-wise comparisons, we can use regression analysis. There are several methods of regression analysis that are suitable for data with censoring. The most commonly used in practice is the proportional hazards regression model, or "Cox model". Below we run a very simple proportional hazards regression model using gender as a predictor of mortality.

In [19]:

```
model1 = sm.PHReg(df.PERMTH_INT, df.HSSEX, status=df.MORTSTAT)
result1 = model1.fit()
result1.summary()
```

Out[19]:

Next we will fit some models including blood pressure as a predictor of mortality. The blood pressure variable has a special code 888 indicating missing values. Below we remove all records with these values from the data set, and check the distribution of the remaining values.

In [20]:

```
ii = (df.HAZA8AK1 != 888)
df = df.loc[ii]
plt.hist(df.HAZA8AK1.values)
plt.xlabel("SBP", size=16)
plt.ylabel("Frequency", size=16)
```

Out[20]:

Next we use a formula to fit a regression model including main effects and interactions for blood pressure and sex. Note that the proportional hazards regression models must not have an intercept, so we add "0 +" to the formula.

In [21]:

```
model2 = sm.PHReg.from_formula("PERMTH_INT ~ 0 + HAZA8AK1*HSSEX", status="MORTSTAT", data=df)
result2 = model2.fit()
result2.summary()
```

Out[21]:

Since the blood pressure effect may not be linear, we can try modeling it using splines.

In [22]:

```
model3 = sm.PHReg.from_formula("PERMTH_INT ~ 0 + bs(HAZA8AK1, 4) + HSSEX", status="MORTSTAT", data=df)
result3 = model3.fit()
result3.summary()
```

Out[22]:

We can use the `predict_functional`

function to graph the relationship between blood pressure and the mortality hazard. The association is U-shaped -- people with very low or very high blood pressure are both at higher risk of death. The risk of death for people with low blood pressure could be due to confounding with other factors that induce both low blood pressure and death.

In [23]:

```
from statsmodels.sandbox.predict_functional import predict_functional
plt.clf()
plt.grid(True)
sexl = {1: "Male", 2: "Female"}
for sex in 1,2:
pr, cb, fv = predict_functional(result3, "HAZA8AK1", values={"HSSEX": sex})
plt.plot(fv, pr.predicted_values, label=sexl[sex])
ha, lb = plt.gca().get_legend_handles_labels()
plt.legend(ha, lb, loc="upper left")
plt.xlabel("SBP", size=16)
plt.ylabel("Log hazard ratio", size=16)
```

Out[23]: