The Census Bureau publishes a margin of error (MOE) alongside each American Community (ACS) estimte. These MOEs give an understanding of how reliable that particulare estimate is. In this demonstration we show how to compute an MOE when multiple ACS estimates are combined.
import cenpy as cen
from cenpy.moe import analytic_utils
from cenpy.moe import pseudo_utils
%matplotlib inline
Identify some variables for the demonstration
cols_raw = ['B01003_001', # Total Population
'B03001_003', # Hispanic
'B19313_001', # Aggregate Income
'B15002_012', # Males with Some College, Less Than 1 Year
'B15002_013', # Males with Some College, 1 Or More Years, No Degree
'B15002_014'] # Males with Associate's Degree
We will need the published estimates and margins of error
# Estimates end in an "E" and MOEs end in an "M"
cols_E = [i+'E' for i in cols_raw]
cols_M = [i+'M' for i in cols_raw]
cols_E
['B01003_001E', 'B03001_003E', 'B19313_001E', 'B15002_012E', 'B15002_013E', 'B15002_014E']
cols_M
['B01003_001M', 'B03001_003M', 'B19313_001M', 'B15002_012M', 'B15002_013M', 'B15002_014M']
We will pull data from the 2014-2018 detailed tables from the ACS
api_database = 'ACSDT5Y2018'
api_conn = cen.remote.APIConnection(api_database)
cen.explorer.explain(api_database)
{'American Community Survey: 1-Year Estimates: Detailed Tables 5-Year': 'The American Community Survey (ACS) is an ongoing survey that provides data every year -- giving communities the current information they need to plan investments and services. The ACS covers a broad range of topics about social, economic, demographic, and housing characteristics of the U.S. population. Summary files include the following geographies: nation, all states (including DC and Puerto Rico), all metropolitan areas, all congressional districts (114th congress), all counties, all places, and all tracts and block groups. Summary files contain the most detailed cross-tabulations, many of which are published down to block groups. The data are population and housing counts. There are over 64,000 variables in this dataset.'}
The demostration will be for all census tracts in Coconino County, Arizona
g_unit = 'tract'
g_filter = {'state':'04', 'county':'005'} # Coconino County, Arizona
With all the pieces in place, we can now pull the data from the Census API
data_E = api_conn.query(cols_E, geo_unit=g_unit, geo_filter=g_filter)
data_M = api_conn.query(cols_M, geo_unit=g_unit, geo_filter=g_filter)
data_E[cols_E] = data_E[cols_E].astype('int') # convert downloaded data to ints
data_M[cols_M] = data_M[cols_M].astype('int') # convert downloaded data to ints
Here is what the estimates look like
data_E
B01003_001E | B03001_003E | B19313_001E | B15002_012E | B15002_013E | B15002_014E | state | county | tract | |
---|---|---|---|---|---|---|---|---|---|
0 | 4611 | 618 | 155911000 | 63 | 201 | 101 | 04 | 005 | 001101 |
1 | 8339 | 1989 | 200765900 | 197 | 390 | 198 | 04 | 005 | 001102 |
2 | 5193 | 598 | 182234100 | 205 | 297 | 90 | 04 | 005 | 001301 |
3 | 5633 | 845 | 162323000 | 170 | 419 | 97 | 04 | 005 | 001302 |
4 | 6735 | 824 | 322765600 | 49 | 262 | 337 | 04 | 005 | 002200 |
5 | 5891 | 641 | 186444600 | 305 | 838 | 169 | 04 | 005 | 002300 |
6 | 3795 | 58 | 56610200 | 28 | 139 | 56 | 04 | 005 | 942201 |
7 | 3467 | 55 | 42392000 | 38 | 106 | 15 | 04 | 005 | 942202 |
8 | 4726 | 24 | 75087400 | 43 | 221 | 123 | 04 | 005 | 944900 |
9 | 4438 | 108 | 47734200 | 64 | 147 | 76 | 04 | 005 | 945000 |
10 | 3943 | 67 | 52120200 | 44 | 144 | 64 | 04 | 005 | 945100 |
11 | 4499 | 72 | 71813000 | 35 | 245 | 103 | 04 | 005 | 945200 |
12 | 6959 | 1076 | 257579300 | 31 | 476 | 311 | 04 | 005 | 000900 |
13 | 3850 | 636 | 153190500 | 70 | 261 | 73 | 04 | 005 | 000200 |
14 | 6438 | 2440 | 155474100 | 149 | 291 | 248 | 04 | 005 | 000300 |
15 | 6122 | 1302 | 99347200 | 172 | 237 | 158 | 04 | 005 | 000800 |
16 | 3517 | 390 | 159329400 | 46 | 170 | 93 | 04 | 005 | 000100 |
17 | 2720 | 256 | 112123900 | 59 | 120 | 74 | 04 | 005 | 001200 |
18 | 2903 | 254 | 109610200 | 73 | 206 | 45 | 04 | 005 | 001500 |
19 | 3706 | 647 | 129996900 | 44 | 144 | 97 | 04 | 005 | 000700 |
20 | 5414 | 993 | 142256900 | 111 | 115 | 144 | 04 | 005 | 000400 |
21 | 3619 | 916 | 90293300 | 108 | 209 | 130 | 04 | 005 | 000500 |
22 | 3475 | 1054 | 85740300 | 149 | 213 | 110 | 04 | 005 | 001700 |
23 | 7547 | 707 | 160342600 | 163 | 483 | 155 | 04 | 005 | 002100 |
24 | 12368 | 2344 | 55454100 | 7 | 37 | 32 | 04 | 005 | 001000 |
25 | 5262 | 302 | 218984700 | 111 | 583 | 29 | 04 | 005 | 000600 |
26 | 3180 | 276 | 157001500 | 89 | 125 | 140 | 04 | 005 | 001600 |
27 | 1867 | 174 | 57021800 | 38 | 189 | 40 | 04 | 005 | 002000 |
...and the MOEs
data_M
B01003_001M | B03001_003M | B19313_001M | B15002_012M | B15002_013M | B15002_014M | state | county | tract | |
---|---|---|---|---|---|---|---|---|---|
0 | 406 | 170 | 22714480 | 42 | 64 | 52 | 04 | 005 | 001101 |
1 | 615 | 345 | 25521201 | 129 | 170 | 118 | 04 | 005 | 001102 |
2 | 599 | 255 | 22397589 | 95 | 114 | 52 | 04 | 005 | 001301 |
3 | 681 | 200 | 18576594 | 86 | 140 | 66 | 04 | 005 | 001302 |
4 | 635 | 209 | 63593535 | 57 | 128 | 141 | 04 | 005 | 002200 |
5 | 740 | 234 | 23082343 | 132 | 221 | 87 | 04 | 005 | 002300 |
6 | 304 | 37 | 5737985 | 16 | 35 | 23 | 04 | 005 | 942201 |
7 | 321 | 38 | 4566029 | 17 | 29 | 13 | 04 | 005 | 942202 |
8 | 517 | 28 | 10494052 | 35 | 101 | 58 | 04 | 005 | 944900 |
9 | 302 | 60 | 5082556 | 33 | 44 | 35 | 04 | 005 | 945000 |
10 | 316 | 33 | 4527569 | 22 | 37 | 22 | 04 | 005 | 945100 |
11 | 505 | 61 | 8895939 | 32 | 92 | 52 | 04 | 005 | 945200 |
12 | 575 | 318 | 45061154 | 37 | 233 | 195 | 04 | 005 | 000900 |
13 | 410 | 252 | 19648625 | 41 | 91 | 56 | 04 | 005 | 000200 |
14 | 603 | 543 | 22955692 | 79 | 92 | 132 | 04 | 005 | 000300 |
15 | 646 | 255 | 13744500 | 91 | 99 | 68 | 04 | 005 | 000800 |
16 | 277 | 167 | 17873519 | 37 | 59 | 50 | 04 | 005 | 000100 |
17 | 202 | 134 | 14820577 | 41 | 87 | 67 | 04 | 005 | 001200 |
18 | 394 | 160 | 20598280 | 58 | 98 | 40 | 04 | 005 | 001500 |
19 | 372 | 237 | 17359157 | 34 | 60 | 68 | 04 | 005 | 000700 |
20 | 409 | 284 | 15198039 | 69 | 53 | 78 | 04 | 005 | 000400 |
21 | 400 | 237 | 12601177 | 76 | 95 | 68 | 04 | 005 | 000500 |
22 | 131 | 165 | 9039383 | 66 | 67 | 56 | 04 | 005 | 001700 |
23 | 19 | 380 | 19269569 | 106 | 164 | 114 | 04 | 005 | 002100 |
24 | 1302 | 362 | 8854217 | 10 | 37 | 30 | 04 | 005 | 001000 |
25 | 455 | 135 | 36625917 | 96 | 216 | 31 | 04 | 005 | 000600 |
26 | 152 | 177 | 33347773 | 59 | 57 | 98 | 04 | 005 | 001600 |
27 | 413 | 132 | 14479669 | 48 | 86 | 39 | 04 | 005 | 002000 |
Three options are available in cenpy:
a. Analytic
b. Pseudo
c. Replicate
The analytic approach uses classic statistical theory to compute the MOEs.
In this first example, we sum three columns to get the unpublished value for males with more than high school diploma, but less than a bachelors degree.
# Summing values
results_sum = analytic_utils.analytic_sum(data_E[['B15002_012E','B15002_013E','B15002_014E']],
data_M[['B15002_012M','B15002_013M','B15002_014M']])
results_sum
est | moe | |
---|---|---|
0 | 365 | 92.541882 |
1 | 785 | 243.854465 |
2 | 592 | 157.241852 |
3 | 686 | 177.064960 |
4 | 648 | 198.781287 |
5 | 1312 | 271.724125 |
6 | 223 | 44.833024 |
7 | 159 | 36.041643 |
8 | 387 | 121.614144 |
9 | 287 | 65.192024 |
10 | 252 | 48.342528 |
11 | 383 | 110.417390 |
12 | 818 | 306.076788 |
13 | 404 | 114.446494 |
14 | 688 | 179.245642 |
15 | 567 | 150.685102 |
16 | 309 | 85.732141 |
17 | 253 | 117.213480 |
18 | 324 | 120.697970 |
19 | 285 | 96.850400 |
20 | 370 | 116.850332 |
21 | 447 | 139.373599 |
22 | 472 | 109.457754 |
23 | 801 | 226.115015 |
24 | 76 | 48.672374 |
25 | 723 | 238.396728 |
26 | 354 | 127.804538 |
27 | 267 | 105.929222 |
Second, we compute per capita income, which is a ratio of aggregate income to total population.
# Ratio
results_ratio = analytic_utils.analytic_ratio(data_E[['B19313_001E','B01003_001E']],
data_M[['B19313_001M','B01003_001M']])
results_ratio
est | moe | |
---|---|---|
0 | 33812.838864 | 5755.941612 |
1 | 24075.536635 | 3538.230172 |
2 | 35092.258810 | 5914.981952 |
3 | 28816.438843 | 4797.096911 |
4 | 47923.622866 | 10467.668017 |
5 | 31649.057885 | 5581.939327 |
6 | 14917.048748 | 1927.167003 |
7 | 12227.285838 | 1736.695411 |
8 | 15888.150656 | 2819.843781 |
9 | 10755.790897 | 1359.142260 |
10 | 13218.412376 | 1562.277944 |
11 | 15961.991554 | 2668.317933 |
12 | 37013.838195 | 7161.150235 |
13 | 39789.740260 | 6633.342759 |
14 | 24149.440820 | 4222.569512 |
15 | 16227.899379 | 2823.603357 |
16 | 45302.644299 | 6209.514257 |
17 | 41222.022059 | 6249.847617 |
18 | 37757.561144 | 8752.544168 |
19 | 35077.415003 | 5859.853587 |
20 | 26275.748061 | 3438.087137 |
21 | 24949.792760 | 4441.686777 |
22 | 24673.467626 | 2762.555735 |
23 | 21245.872532 | 2553.835527 |
24 | 4483.675614 | 857.494397 |
25 | 41616.248575 | 7835.640446 |
26 | 49371.540881 | 10748.973683 |
27 | 30541.938939 | 10285.681097 |
Finally, we compute the proportion Hispanic residents are of the total population.
# Proportion
results_prop = analytic_utils.analytic_prop(data_E[['B03001_003E','B01003_001E']],
data_M[['B03001_003M','B01003_001M']])
results_prop
est | moe | |
---|---|---|
0 | 0.134027 | 0.034929 |
1 | 0.238518 | 0.037446 |
2 | 0.115155 | 0.047274 |
3 | 0.150009 | 0.030524 |
4 | 0.122346 | 0.028808 |
5 | 0.108810 | 0.037296 |
6 | 0.015283 | 0.009672 |
7 | 0.015864 | 0.010862 |
8 | 0.005078 | 0.005899 |
9 | 0.024335 | 0.013418 |
10 | 0.016992 | 0.008258 |
11 | 0.016004 | 0.013439 |
12 | 0.154620 | 0.043874 |
13 | 0.165195 | 0.063046 |
14 | 0.379000 | 0.076509 |
15 | 0.212676 | 0.035091 |
16 | 0.110890 | 0.046674 |
17 | 0.094118 | 0.048766 |
18 | 0.087496 | 0.053821 |
19 | 0.174582 | 0.061502 |
20 | 0.183413 | 0.050594 |
21 | 0.253109 | 0.059212 |
22 | 0.303309 | 0.046085 |
23 | 0.093680 | 0.050351 |
24 | 0.189521 | 0.021416 |
25 | 0.057393 | 0.025171 |
26 | 0.086792 | 0.055506 |
27 | 0.093198 | 0.067629 |
The pseudo approach generates random draws from a distribution around each published estimate built using the published estimate and MOE. These random draws are used to compute the MOEs on the combined estimates.
Following the example using the analytic approach, we again compute the sum of the number of males with more than a high school diploma but less than a college degree. First, we need to construct a function returning the sum of the ests
arguement in the pseudo
function.
def get_sum(ests):
return ests.sum(axis=1)
Now we pass our get_sum
function to the func
parameter along with the published estiamtes and margains of error we used above to pseudo
. This will return the margain of errors for the summed estimates derived usign the pseudo method.
#Summing
pseudo_results_sum = pseudo_utils.pseudo(get_sum,
ests=data_E[['B15002_012E','B15002_013E','B15002_014E']],
moes=data_M[['B15002_012M','B15002_013M','B15002_014M']])
pseudo_results_sum
est | moe | |
---|---|---|
0 | 365 | 84.735357 |
1 | 785 | 250.007884 |
2 | 592 | 140.238331 |
3 | 686 | 167.853122 |
4 | 648 | 181.748614 |
5 | 1312 | 272.964043 |
6 | 223 | 41.073046 |
7 | 159 | 37.448461 |
8 | 387 | 120.257335 |
9 | 287 | 64.421342 |
10 | 252 | 48.405048 |
11 | 383 | 106.709926 |
12 | 818 | 295.366184 |
13 | 404 | 122.884843 |
14 | 688 | 187.736771 |
15 | 567 | 150.526368 |
16 | 309 | 89.161340 |
17 | 253 | 121.917382 |
18 | 324 | 123.955399 |
19 | 285 | 89.015580 |
20 | 370 | 116.621658 |
21 | 447 | 148.584868 |
22 | 472 | 116.897372 |
23 | 801 | 253.655756 |
24 | 76 | 42.377793 |
25 | 723 | 229.177557 |
26 | 354 | 128.139737 |
27 | 267 | 106.464339 |
Next we will demonstrate how to derive the pseudo MOEs for a ratio calculation. Again, we create a function, get_div
, to pass to pseudo
.
def get_div(ests):
return ests.iloc[:,0] / (ests.iloc[:,1]*1.0)
We now pass the function to pseudo
as done in the summing example.
#Ratio
pseudo_results_ratio = pseudo_utils.pseudo(get_div,
ests=data_E[['B19313_001E','B01003_001E']],
moes=data_M[['B19313_001M','B01003_001M']])
pseudo_results_ratio
est | moe | |
---|---|---|
0 | 33812.838864 | 5588.264669 |
1 | 24075.536635 | 3486.556631 |
2 | 35092.258810 | 5461.779371 |
3 | 28816.438843 | 5061.058406 |
4 | 47923.622866 | 9465.400697 |
5 | 31649.057885 | 5534.522928 |
6 | 14917.048748 | 2043.322715 |
7 | 12227.285838 | 1854.067618 |
8 | 15888.150656 | 2823.545890 |
9 | 10755.790897 | 1287.138719 |
10 | 13218.412376 | 1464.332840 |
11 | 15961.991554 | 2871.830815 |
12 | 37013.838195 | 6493.285858 |
13 | 39789.740260 | 6507.534991 |
14 | 24149.440820 | 4240.028495 |
15 | 16227.899379 | 2787.390325 |
16 | 45302.644299 | 5949.813275 |
17 | 41222.022059 | 6104.126699 |
18 | 37757.561144 | 7763.523896 |
19 | 35077.415003 | 5580.745826 |
20 | 26275.748061 | 3703.960585 |
21 | 24949.792760 | 4134.406837 |
22 | 24673.467626 | 2625.090636 |
23 | 21245.872532 | 2858.229990 |
24 | 4483.675614 | 907.514745 |
25 | 41616.248575 | 8763.624949 |
26 | 49371.540881 | 9494.487316 |
27 | 30541.938939 | 10340.741340 |
To produce replicable results, the seed option can be set. This parameter makes a call to numpy.random.seed
. The default setting is seed=None
and will produce variable results for the MOEs with each call.
#Ratio w/ set seed
pseudo_results_ratio_w_seed = pseudo_utils.pseudo(get_div,
ests=data_E[['B19313_001E','B01003_001E']],
moes=data_M[['B19313_001M','B01003_001M']],
seed=123)
pseudo_results_ratio_w_seed
est | moe | |
---|---|---|
0 | 33812.838864 | 5199.547268 |
1 | 24075.536635 | 3099.242861 |
2 | 35092.258810 | 5456.081228 |
3 | 28816.438843 | 5159.557118 |
4 | 47923.622866 | 10679.994625 |
5 | 31649.057885 | 5280.682360 |
6 | 14917.048748 | 2124.933063 |
7 | 12227.285838 | 1743.989476 |
8 | 15888.150656 | 2542.562317 |
9 | 10755.790897 | 1277.229939 |
10 | 13218.412376 | 1603.675747 |
11 | 15961.991554 | 2730.129336 |
12 | 37013.838195 | 7144.187530 |
13 | 39789.740260 | 6695.631941 |
14 | 24149.440820 | 4223.362927 |
15 | 16227.899379 | 2663.508198 |
16 | 45302.644299 | 5497.257798 |
17 | 41222.022059 | 5937.705580 |
18 | 37757.561144 | 8643.619015 |
19 | 35077.415003 | 5431.386916 |
20 | 26275.748061 | 3082.989079 |
21 | 24949.792760 | 4105.625732 |
22 | 24673.467626 | 2643.484526 |
23 | 21245.872532 | 3141.699151 |
24 | 4483.675614 | 780.529335 |
25 | 41616.248575 | 7669.268454 |
26 | 49371.540881 | 11278.086226 |
27 | 30541.938939 | 9370.148160 |
The number of simulations run to produce the results is also adjustable through the sims
parameter. Increasing the number of simulations results in more stable MOE values if the algorithm is run multiple times, but takes longer to run. The default setting is sims=100
.
#Ratio w/ set number of sims
pseudo_results_ratio_w_sims = pseudo_utils.pseudo(get_div,
ests=data_E[['B19313_001E','B01003_001E']],
moes=data_M[['B19313_001M','B01003_001M']],
sims=1000)
pseudo_results_ratio_w_sims
est | moe | |
---|---|---|
0 | 33812.838864 | 5920.022199 |
1 | 24075.536635 | 3486.954777 |
2 | 35092.258810 | 5891.939088 |
3 | 28816.438843 | 4746.521858 |
4 | 47923.622866 | 10589.112366 |
5 | 31649.057885 | 5628.217362 |
6 | 14917.048748 | 1951.230050 |
7 | 12227.285838 | 1748.012442 |
8 | 15888.150656 | 2898.571655 |
9 | 10755.790897 | 1359.418504 |
10 | 13218.412376 | 1594.214628 |
11 | 15961.991554 | 2683.506220 |
12 | 37013.838195 | 7080.390284 |
13 | 39789.740260 | 6418.354355 |
14 | 24149.440820 | 4211.844071 |
15 | 16227.899379 | 2791.605432 |
16 | 45302.644299 | 6148.847893 |
17 | 41222.022059 | 6114.515450 |
18 | 37757.561144 | 8853.231000 |
19 | 35077.415003 | 5962.098206 |
20 | 26275.748061 | 3413.756703 |
21 | 24949.792760 | 4305.782054 |
22 | 24673.467626 | 2653.020760 |
23 | 21245.872532 | 2623.506060 |
24 | 4483.675614 | 870.426504 |
25 | 41616.248575 | 8172.820273 |
26 | 49371.540881 | 10552.436291 |
27 | 30541.938939 | 10801.993313 |
The internal algorithm used to compute the MOE simulates estimates from a normal distribution, which means that fractions of people (or housing units) will be simulated. Since this is not realisitc, whole=True
can be set to always generate whole values. Setting whole=False
(the default) means that the distribution of simulated values we more closely match the normal distibution from which the simulatations were drawn.
#Sum MOE generated from whole people
pseudo_results_sum_w_whole = pseudo_utils.pseudo(get_sum,
ests=data_E[['B15002_012E','B15002_013E','B15002_014E']],
moes=data_M[['B15002_012M','B15002_013M','B15002_014M']],
whole=True)
pseudo_results_sum_w_whole
est | moe | |
---|---|---|
0 | 365 | 95.561324 |
1 | 785 | 263.578440 |
2 | 592 | 150.353658 |
3 | 686 | 173.452135 |
4 | 648 | 213.159069 |
5 | 1312 | 289.587681 |
6 | 223 | 45.635722 |
7 | 159 | 34.752764 |
8 | 387 | 109.892965 |
9 | 287 | 64.969844 |
10 | 252 | 50.101928 |
11 | 383 | 110.514091 |
12 | 818 | 336.864738 |
13 | 404 | 107.684536 |
14 | 688 | 181.048703 |
15 | 567 | 148.519262 |
16 | 309 | 89.188083 |
17 | 253 | 115.830401 |
18 | 324 | 130.379528 |
19 | 285 | 109.033326 |
20 | 370 | 98.904817 |
21 | 447 | 125.572913 |
22 | 472 | 104.921034 |
23 | 801 | 236.569068 |
24 | 76 | 50.355092 |
25 | 723 | 223.743146 |
26 | 354 | 130.769898 |
27 | 267 | 109.245031 |