#!/usr/bin/env python # coding: utf-8 # # Using SD models to understand the differences between population measures at varying levels of geographic disagregation # In this recipe, we will use data at a national level to infer parameters for a population aging model. We'll then try two different ways of using this trained model to understand variation between the behavior of each of the states. # # # ## About this technique # Firstly, we'll use the parameters fit at the national level to predict census data at the disaggregated level, and compare these predicted state-level outputs with the measured values. This gives us a sense for how different the populations of the states are from what we should expect given our understanding of the national picture. # # Secondly, we'll fit parameters to a model at each of the state levels actual measured census data, and then compare the differences in fit parameters to each other and to the national expectation. This is a helpful analysis if the parameter itself (and its inter-state variance) is what we find interesting. # In[1]: get_ipython().run_line_magic('pylab', 'inline') import pandas as pd import pysd import scipy.optimize import geopandas as gpd # ##Ingredients # # ### Population data by age cohort # We start with data from the decennial census years 2000 and 2010, for the male population by state and county. We have aggregated the data into ten-year age cohorts (with the last cohort representing everyone over 80 years old). The data collection task is described [here](data/Census/US Census Data Collection.ipynb). In this analysis we will only use data at the state and national levels. # In[2]: data = pd.read_csv('../../data/Census/Males by decade and county.csv', header=[0,1], index_col=[0,1]) data.head() # ###A model of an aging population # # The model we'll use to represent the population is a simple aging chain, with individuals aggregated into stocks by decade, to match the agregation we used for the above data. Each cohort is promoted with a timescale of 10 years, and there is some net inmigration, outmigration, and death subsumed into the `loss` flow associated with each cohort. This loss is controled by some yearly fraction that it will be our task to understand. # # Still Image of Vensim Aging Chain Model .. image:: ../../../source/models/Aging_Chain/Aging_Chain.png :width: 600 px # In[3]: model = pysd.read_vensim('../../models/Aging_Chain/Aging_Chain.mdl') # Our model is initialy parameterized with 10 individuals in each stock, no births, and a uniform loss rate of 5%. We'll use data to set the initial conditions, and infer the loss rates. Estimating births is difficult, and so for this analysis, we'll pay attention only to individuals who have been born before the year 2000. # In[4]: model.run().plot(); # ### Geography Information # # This information comes to us as a shape file `.shp` with its associated `.dbf` and `.shx` conspirator files. Lets check the plotting functionality: # In[5]: state_geo = gpd.read_file('../../data/Census/US_State.shp') state_geo.set_index('StateFIPSN', inplace=True) state_geo.plot(); state_geo.head(2) # ## Recipe Part A: Predict state-level values from national model fit # # ### Step 1: Initialize the model using census data # # We can aggregate the county level data to the national scale by summing across all geographies. This is relatively straightforward. # In[6]: country = data.sum() country # If we run the model using national data from the year 2000 as starting conditions, we can see how the cohorts develop, given our arbitrary loss rate values: # In[7]: model.run(return_timestamps=range(2000,2011), initial_condition=(2000, country['2000'])).plot(); # ### Step 2: Fit the national level model to the remaining data # We've used half of our data (from the year 2000 census) to initialize our model. Now we'll use an optimization routine to choose the loss rate parameters that best predict the census 2010 data. We'll use the same basic operations described in the previous recipe: [Fitting with Optimization](2_1_Fitting_with_Optimization.ipynb). # # To make this simple, we'll write a function that takes a list of potential model parameters, and returns the model's prediction in the year 2010 # # In[8]: def exec_model(paramlist): params = dict(zip(['dec_%i_loss_rate'%i for i in range(1,10)], paramlist)) output = model.run(initial_condition=(2000,country['2000']), params=params, return_timestamps=2010) return output # Now we'll define an error function that calls this executor and calculates a sum of squared errors value. Remember that because we don't have birth information, we'll only calculate error based upon age cohorts 2 through 9. # In[9]: def error(paramlist): output = exec_model(paramlist) errors = output - country['2010'] #don't tally errors in the first cohort, as we don't have info about births return sum(errors.values[0,1:]**2) # Now we can use the minimize function from scipy to find a vector of parameters which brings the 2010 predictions into alignment with the data. # In[10]: res = scipy.optimize.minimize(error, x0=[.05]*9, method='L-BFGS-B') country_level_fit_params = dict(zip(['dec_%i_loss_rate'%i for i in range(1,10)], res['x'])) country_level_fit_params # If we run the national model forwards with these parameters, we see generally good behavior, except for the 0-9yr demographic bracket, from whom we expect less self-control. (And because we don't have births data.) # In[11]: model.run(params=country_level_fit_params, return_timestamps=range(2000,2011), initial_condition=(2000, country['2000'])).plot(); # ###Step 3: Make state-level predictions # If we want to look at the variances between the states and the national level, we can try making state-level predictions using state-specific initial conditions, but parameters fit at the national level. # In[12]: states = data.groupby(level=0).sum() states.head() # We can now generate a prediction by setting the model's intitial conditions with state level data, and parameters fit in the national case. I've created a `model_runner` helper function to make the code easier to read, but this could be conducted in a single line if we so chose. # In[13]: def model_runner(row): result = model.run(params=country_level_fit_params, initial_condition=(2000, row['2000']), return_timestamps=2010, return_columns=[ "dec_1", "dec_2", "dec_3", "dec_4", "dec_5", "dec_6", "dec_7", "dec_8", "dec_9" ]) return result.loc[2010] state_predictions = states.apply(model_runner, axis=1) state_predictions.head() # ### Step 4: Compare model predictions with measured data # Comparing the state level predictions with the actual data, we can see where our model most under or overpredicts population for each region/cohort combination. # In[14]: diff = state_predictions-states['2010'] diff.head() # This is a little easier to understand if we weight it by the actual measured population: # In[15]: diff_percent = (state_predictions-states['2010'])/states['2010'] diff_percent.head() # ###Step 5: Merge with geo data to plot # # I'm using geopandas to manage the shapefiles, and it has its own plotting functionality. Unfortunately, it is not a particularly well defined functionality. # In[16]: geo_diff = state_geo.join(diff_percent) geo_diff.plot(column='dec_4') geo_diff.head() # ## Recipe Part B: fit state-by-state models # # Now lets try optimizing the model's parameters specifically to each state, and comparing with the national picture. # # ### Step 1: Write the optimization functions to account for the state # We'll start as before with functions that run the model and compute the error (this time with a parameter for the information about the state in question) and add a function to optimize and return the best fit parameters for each state. # In[17]: def exec_model(paramlist, state): params = dict(zip(['dec_%i_loss_rate'%i for i in range(1,10)], paramlist)) output = model.run(initial_condition=(2000,state['2000']), params=params, return_timestamps=2010).loc[2010] return output def error(paramlist, state): output = exec_model(paramlist, state) errors = output - state['2010'] #don't tally errors in the first cohort, as we don't have info about births sse = sum(errors.values[1:]**2) return sse # ###Step 2: Apply optimization to each state # We can wrap the optimizer in a function that takes census information about each state and returns an optimized set of parameters for that state. If we apply it to the states dataframe, we can get out a similar dataframe that includes optimized parameters. # In[18]: get_ipython().run_cell_magic('capture', '', "def optimize_params(row):\n res = scipy.optimize.minimize(lambda x: error(x, row),\n x0=[.05]*9,\n method='L-BFGS-B');\n return pd.Series(index=['dec_%i_loss_rate'%i for i in range(1,10)], data=res['x'])\n \nstate_fit_params = states.apply(optimize_params, axis=1)\nstate_fit_params.head()\n") # ### Step 3: Merge with geographic data # As we're looking at model parameters which themselves are multiplied by populations to generate actual flows of people, we can look at the difference between parameters directly without needing to normalize. # In[19]: geo_diff = state_geo.join(state_fit_params) geo_diff.plot(column='dec_4_loss_rate') geo_diff.head(3)