US Medicare Payment Data - identifying providers with high payment totals

The US government recently released a data set containing all payments (partially aggregated) made by the Medicare program in 2012. For each health care provider, the billings are grouped into service types, and within each service type group, the total payment and standard deviation of payments are provided.

The data are available here

The official documentation for the data is here.

One use of these data is to identify providers who receive very high payments from the Medicare program. In some cases this may reflect fraud or waste. Here is a link to a media article about one such situation.

The dataset is too large to load into SageMathCloud (SMC), so we wrote a script to split the data by state. We ran this script off-line and uploaded a few of the state-level files into SMC.

If you are curious to see how some related analyses of these data can be done in R, check out this blog post.

First we import some libraries that we will be using.

In [1]:
import pandas as pd
import numpy as np

Next we read in the data from one state. The state-level data is small enough that it can be loaded entirely into memory.

In [2]:
fname = "FL-subset.csv.gz"
data = pd.read_csv(fname, compression="gzip")
(667995, 27)
npi                                   int64
nppes_provider_last_org_name         object
nppes_provider_first_name            object
nppes_provider_mi                    object
nppes_credentials                    object
nppes_provider_gender                object
nppes_entity_code                    object
nppes_provider_street1               object
nppes_provider_street2               object
nppes_provider_city                  object
nppes_provider_zip                    int64
nppes_provider_state                 object
nppes_provider_country               object
provider_type                        object
medicare_participation_indicator     object
place_of_service                     object
hcpcs_code                           object
hcpcs_description                    object
line_srvc_cnt                       float64
bene_unique_cnt                       int64
bene_day_srvc_cnt                     int64
average_Medicare_allowed_amt        float64
stdev_Medicare_allowed_amt          float64
average_submitted_chrg_amt          float64
stdev_submitted_chrg_amt            float64
average_Medicare_payment_amt        float64
stdev_Medicare_payment_amt          float64
dtype: object

Here we print out a small subset of the data to get a sense of what we are working with.

In [3]:
vnames = ["npi", "nppes_provider_last_org_name", "hcpcs_description", "line_srvc_cnt", "average_Medicare_payment_amt", "stdev_Medicare_payment_amt"]
npi nppes_provider_last_org_name hcpcs_description line_srvc_cnt average_Medicare_payment_amt stdev_Medicare_payment_amt
0 1003000381 BRAGANZA Pt evaluation 23 58.643913 0.638304
1 1003000381 BRAGANZA Electrical stimulation 69 12.378261 3.070669
2 1003000381 BRAGANZA Ultrasound therapy 137 8.767883 1.706447
3 1003000381 BRAGANZA Therapeutic exercises 433 22.297760 4.213327
4 1003000381 BRAGANZA Manual therapy 172 19.488837 4.387613

Let's look at the first record in more detail:

In [4]:
npi                                          1003000381
nppes_provider_last_org_name                   BRAGANZA
nppes_provider_first_name                        LUTHER
nppes_provider_mi                                     Q
nppes_credentials                                    PT
nppes_provider_gender                                 M
nppes_entity_code                                     I
nppes_provider_street1              134 N OLD DIXIE HWY
nppes_provider_street2                              NaN
nppes_provider_city                           LADY LAKE
nppes_provider_zip                            321594347
nppes_provider_state                                 FL
nppes_provider_country                               US
provider_type                        Physical Therapist
medicare_participation_indicator                      Y
place_of_service                                      O
hcpcs_code                                        97001
hcpcs_description                         Pt evaluation
line_srvc_cnt                                        23
bene_unique_cnt                                      23
bene_day_srvc_cnt                                    23
average_Medicare_allowed_amt                      73.47
stdev_Medicare_allowed_amt                            0
average_submitted_chrg_amt                     96.95652
stdev_submitted_chrg_amt                       5.051717
average_Medicare_payment_amt                   58.64391
stdev_Medicare_payment_amt                    0.6383044
Name: 0, dtype: object

Each row of the data set is an aggregate of several individual charges. The number of charges that are aggregated is given by line_srvc_cnt. The average and standard deviation of these charges are given by average_Medicare_payment_amt and stdev_Medicare_payment_amt.

The key value of interest here is the total amount of money paid by Medicare to each provider. This is not an explicit variable in the data set, but we can create it by multiplying the average payment by the number of payments, within each record of the dataset.

In [5]:
data["total_payment"] = data["line_srvc_cnt"] * data["average_Medicare_payment_amt"]

Medicare makes payments to both individual providers, and to organizations. Most of the payments are to individuals and it's not clear that the payment amounts to providers can be directly compared to the payment amounts to individuals. So here we include only the individuals.

In [6]:
data_ind = data.loc[data["nppes_entity_code"] == "I", :]
(636249, 28)
(667995, 28)

Payments to individual providers

The overall analysis follows the split-apply-combine model. We split the data by provider, apply a function to aggregate the payments (by summing them), and combine the results into a new Pandas DataFrame.

First we look at a couple of inefficient ways to do this that do not take advantage of the optimised split-apply-combine tools in Pandas. The following two functions are not executed since they are quite slow. Uncomment the function calls if you really want to run them.

In [7]:
def total_by_provider_naive(data):
    Returns a dictionary mapping each provider number to the total payment made to 
    the provider.
    ** Do not use, very slow **
    from collections import defaultdict
    totals = defaultdict(lambda : 0.)
    for i in data.index:
        totals[data.loc[i, "npi"]] += data.loc[i, "total_payment"] 
    return totals

# totals1 = total_by_provider_naive(data_ind)

Using position-based indexing doesn't help much:

In [8]:
def total_by_provider_naive_pos(data):
    Returns a dictionary mapping each provider number to the total payment made to 
    the provider.
    ** Do not use, very slow **
    from collections import defaultdict
    totals = defaultdict(lambda : 0.)
    for i in data.shape[0]:
        row = data.iloc[i, :]
        totals[row["npi"]] += row["total_payment"] 
    return totals

#totals2 = total_by_provider_naive_pos(data_ind)

Here is a much faster way to do the calculation. Our goal is to obtain the total payment for each proivider. We first group the data by the provider number, then aggregate within the groups by taking the sum of payment amounts. Finally we sort the result so that the providers with the greatest total payments appear at the top of the list.

In [9]:
totals2 = data_ind.groupby("npi").agg({"total_payment": np.sum})
totals2 = totals2.sort("total_payment", ascending=False)
1245298371  20827340.740274
1033145487  18154815.829820
1922021195  10726482.210034
1821058496   8419176.929998
1790744746   6965575.780054

Next we have an enhanced version of the code in the previous cell that also includes the provider's last name in the output table. We cannot aggregate names by summing, so we provide two aggregation functions. For the payment data, we aggregate by taking the sum. For the provider name, we aggregate by selecting the first value within each group (assuming that the provider name is constant within groups, which it should be).

In [10]:
first = lambda x : x.iloc[0]

totals2 = data_ind.groupby("npi").agg({"total_payment": np.sum, "nppes_provider_last_org_name": first})
totals2 = totals2.sort("total_payment", ascending=False)
nppes_provider_last_org_name total_payment
1245298371 MELGEN 20827340.740274
1033145487 QAMAR 18154815.829820
1922021195 EATON 10726482.210034
1821058496 MALHOTRA 8419176.929998
1790744746 RASKAUSKAS 6965575.780054

One way to visualize the distribution of payments is to make a quantile plot, which is a plot of the sorted data against the index.

In [11]:
plt.xlabel("Provider rank", size=15)
plt.ylabel("Total payment", size=15)
plt.ylim(0, 1e6)
(0, 1000000.0)

Another familiar way to view a simple collection of numbers is using a histogram.

In [12]:
_ = plt.hist(np.asarray(totals2["total_payment"]), bins=100)

The distribution is very skewed, so we can limit the histogram to the providers with total payment less than $200,000.

In [13]:
x = np.asarray(totals2["total_payment"])
x = x[x < 2e5]
plt.hist(x, bins=50, color='lightblue')
plt.xlabel("Payment", size=15)
_ = plt.ylabel("Number of providers", size=15)


  • Determine the maximum total payment among providers within each zip code, then plot it as a histogram or quantile plot.

  • Determine the maximum total payment among providers of each provider type, then plot it as a histogram or quantile plot.