CMS.gov has made available a synthetic dataset of Medicare/Medicaid claims data. The objective is to allow data enterpreneurs to figure out interesting ways to use the data.
As noted in the Data Codebook (PDF), the data has been heavily anonymized and imputed, so any conclusions drawn from this dataset may not be reliable.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import operator
import random
%matplotlib inline
We read the benefit summary data in. There are 59 files representing benefit summaries from 2008-2010. There are a total of 6,758,744 records. We copy these into a separate benefit_summary directory so all the files can be loaded into a single dataframe.
dfs = []
for csvfile in os.listdir("benefit_summary"):
dfs.append(pd.read_csv("benefit_summary/" + csvfile))
df = pd.concat(dfs)
df.shape
(6758685, 32)
# find the distribution of records by gender
df[["BENE_SEX_IDENT_CD"]].groupby("BENE_SEX_IDENT_CD").count()
BENE_SEX_IDENT_CD | |
---|---|
BENE_SEX_IDENT_CD | |
1 | 3002984 |
2 | 3755701 |
# find distribution of records by race
df[["BENE_RACE_CD"]].groupby("BENE_RACE_CD").count()
BENE_RACE_CD | |
---|---|
BENE_RACE_CD | |
1 | 5596084 |
2 | 719751 |
3 | 284654 |
5 | 158196 |
So it looks like the beneficiaries are split almost evenly by gender with men constituting about 45% of the dataset and the women constituting the other 55%.
Race-wise, whites constitute 83%, blacks 11%, other (asians, native americans, etc) 4% and hispanics 2%. This roughly reflects the race mix among older Americans.
# visualize incidence of diseases by gender
df_trunc = df[["BENE_SEX_IDENT_CD", "SP_ALZHDMTA", "SP_CHF",
"SP_CHRNKIDN", "SP_CNCR", "SP_COPD", "SP_DEPRESSN", "SP_DIABETES",
"SP_ISCHMCHT", "SP_OSTEOPRS", "SP_RA_OA", "SP_STRKETIA"]]
df_grouped = df_trunc.groupby("BENE_SEX_IDENT_CD").agg(lambda x: 100.0 * np.mean(x - 1))
df_grouped.transpose()
BENE_SEX_IDENT_CD | 1 | 2 |
---|---|---|
SP_ALZHDMTA | 81.597904 | 79.388801 |
SP_CHF | 71.964619 | 69.425575 |
SP_CHRNKIDN | 83.974040 | 82.471981 |
SP_CNCR | 93.506725 | 93.474401 |
SP_COPD | 87.954048 | 86.703974 |
SP_DEPRESSN | 80.288973 | 77.570579 |
SP_DIABETES | 65.798020 | 62.108698 |
SP_ISCHMCHT | 59.353030 | 56.210438 |
SP_OSTEOPRS | 84.587930 | 82.606789 |
SP_RA_OA | 86.827069 | 85.115562 |
SP_STRKETIA | 96.092187 | 95.756877 |
Men lead in the incidence of disease in almost all cases - they show higher incidence of Alzheimer's Disease, CHF, Depression, Diabetes, Ischemic Heart Disease, Osteoporosis and Rheumatoid/Osteo-Arthritis. Men and women seem to have an equal chance of contracting Chronic Kidney Disease, Cancer, COPD, and Stroke. The data is shown visually below:
df_grouped.plot(kind="bar", legend=False)
<matplotlib.axes.AxesSubplot at 0x1f289bd0>
# visualize incidence of diseases by race
df_trunc = df[["BENE_RACE_CD", "SP_ALZHDMTA", "SP_CHF",
"SP_CHRNKIDN", "SP_CNCR", "SP_COPD", "SP_DEPRESSN", "SP_DIABETES",
"SP_ISCHMCHT", "SP_OSTEOPRS", "SP_RA_OA", "SP_STRKETIA"]]
df_grouped = df_trunc.groupby("BENE_RACE_CD").agg(lambda x: 100.0 * np.mean(x - 1))
df_grouped.transpose()
BENE_RACE_CD | 1 | 2 | 3 | 5 |
---|---|---|---|---|
SP_ALZHDMTA | 80.010754 | 81.348689 | 83.989334 | 82.127235 |
SP_CHF | 69.962656 | 72.319316 | 75.970828 | 73.681383 |
SP_CHRNKIDN | 82.826670 | 83.827324 | 86.614276 | 84.818200 |
SP_CNCR | 93.244848 | 94.377639 | 95.212785 | 94.970796 |
SP_COPD | 87.006378 | 87.860107 | 89.930231 | 88.671016 |
SP_DEPRESSN | 78.489047 | 79.644210 | 81.982688 | 79.309211 |
SP_DIABETES | 62.998876 | 66.469237 | 69.711650 | 67.132544 |
SP_ISCHMCHT | 56.630261 | 61.178935 | 64.912490 | 62.750638 |
SP_OSTEOPRS | 83.063121 | 85.223362 | 86.334287 | 85.459809 |
SP_RA_OA | 85.466444 | 87.533744 | 88.788494 | 87.581228 |
SP_STRKETIA | 95.832729 | 95.970690 | 96.892719 | 96.422160 |
As can be seen from the table above, the incidence of various diseases across different races seems to be fairly constant. The only exceptions are CHF which seems to be less prevalent among whites than other races, and Ischemic Heart Disease, which seem t obe more prevalent among whites than other races. The data is shown visually below:
df_grouped.plot(kind="bar", legend=False)
<matplotlib.axes.AxesSubplot at 0x5c26690>
We now examine observed comorbidities in the population by calculating the correlation between different pairs of diseases. As can be seen from the top 10 pairs below, the correlation between diseases appears to be fairly weak.
colnames = [x for x in df.columns if x.startswith("SP_") and not x.endswith("STATE_CODE")]
colidxs = {x[1]:x[0] for x in enumerate(colnames)}
CM = np.zeros((len(colnames), len(colnames)))
comorbidities = []
for colname1 in colnames:
for colname2 in colnames:
i = colidxs[colname1]
j = colidxs[colname2]
if i == j:
CM[i,j] = 1.0
continue
if i < j:
CM[i,j] = np.corrcoef(df.ix[:, colname1], df.ix[:, colname2])[0,1]
CM[j,i] = CM[i,j]
comorbidities.append((colname1, colname2, CM[i,j]))
continue
sorted(comorbidities, key=operator.itemgetter(2), reverse=True)[0:10]
[('SP_DIABETES', 'SP_ISCHMCHT', 0.47578541650029821), ('SP_CHF', 'SP_ISCHMCHT', 0.43615577919472531), ('SP_CHF', 'SP_DIABETES', 0.41850270374368082), ('SP_CHRNKIDN', 'SP_DIABETES', 0.39713288341420422), ('SP_CHF', 'SP_CHRNKIDN', 0.38213196506731339), ('SP_CHRNKIDN', 'SP_ISCHMCHT', 0.35058275391195598), ('SP_CHF', 'SP_COPD', 0.33059074013020173), ('SP_DEPRESSN', 'SP_DIABETES', 0.3199180438991816), ('SP_ALZHDMTA', 'SP_DIABETES', 0.31346118720405913), ('SP_CHRNKIDN', 'SP_COPD', 0.31327625784275792)]
The heatmap below summarizes the correlations between different chronic diseases graphically. Lighter colors indicate higher correlation. As expected, the lightest block is (7,8) which corresponds to SP_DIABETES and SP_ISCHMCHT.
fig = plt.figure()
plt.hot()
plt.pcolormesh(CM)
plt.colorbar()
<matplotlib.colorbar.Colorbar instance at 0x1d8cad40>