Introduction to cenpy American Community Survey margin of error tools¶

The Census Bureau publishes a margin of error (MOE) alongside each American Community (ACS) estimte. These MOEs give an understanding of how reliable that particulare estimate is. In this demonstration we show how to compute an MOE when multiple ACS estimates are combined.

In [2]:

import cenpy as cen
from cenpy.moe import analytic_utils
from cenpy.moe import pseudo_utils
%matplotlib inline

1. Pull data from the Census API¶

Identify some variables for the demonstration

In [3]:

cols_raw = ['B01003_001', # Total Population
            'B03001_003', # Hispanic
            'B19313_001', # Aggregate Income
            'B15002_012', # Males with Some College, Less Than 1 Year
            'B15002_013', # Males with Some College, 1 Or More Years, No Degree
            'B15002_014'] # Males with Associate's Degree

We will need the published estimates and margins of error

In [4]:

# Estimates end in an "E" and MOEs end in an "M"
cols_E = [i+'E' for i in cols_raw]
cols_M = [i+'M' for i in cols_raw]

cols_E

Out[4]:

['B01003_001E',
 'B03001_003E',
 'B19313_001E',
 'B15002_012E',
 'B15002_013E',
 'B15002_014E']

In [5]:

cols_M

Out[5]:

['B01003_001M',
 'B03001_003M',
 'B19313_001M',
 'B15002_012M',
 'B15002_013M',
 'B15002_014M']

We will pull data from the 2014-2018 detailed tables from the ACS

In [6]:

api_database = 'ACSDT5Y2018'
api_conn = cen.remote.APIConnection(api_database)
cen.explorer.explain(api_database)

Out[6]:

{'American Community Survey: 1-Year Estimates: Detailed Tables 5-Year': 'The American Community Survey (ACS) is an ongoing survey that provides data every year -- giving communities the current information they need to plan investments and services. The ACS covers a broad range of topics about social, economic, demographic, and housing characteristics of the U.S. population.  Summary files include the following geographies: nation, all states (including DC and Puerto Rico), all metropolitan areas, all congressional districts (114th congress), all counties, all places, and all tracts and block groups.  Summary files contain the most detailed cross-tabulations, many of which are published down to block groups. The data are population and housing counts. There are over 64,000 variables in this dataset.'}

The demostration will be for all census tracts in Coconino County, Arizona

In [7]:

g_unit = 'tract'
g_filter = {'state':'04', 'county':'005'} # Coconino County, Arizona

With all the pieces in place, we can now pull the data from the Census API

In [8]:

data_E = api_conn.query(cols_E, geo_unit=g_unit, geo_filter=g_filter)
data_M = api_conn.query(cols_M, geo_unit=g_unit, geo_filter=g_filter)

data_E[cols_E] = data_E[cols_E].astype('int') # convert downloaded data to ints
data_M[cols_M] = data_M[cols_M].astype('int') # convert downloaded data to ints

Here is what the estimates look like

In [9]:

data_E

Out[9]:

	B01003_001E	B03001_003E	B19313_001E	B15002_012E	B15002_013E	B15002_014E	state	county	tract
0	4611	618	155911000	63	201	101	04	005	001101
1	8339	1989	200765900	197	390	198	04	005	001102
2	5193	598	182234100	205	297	90	04	005	001301
3	5633	845	162323000	170	419	97	04	005	001302
4	6735	824	322765600	49	262	337	04	005	002200
5	5891	641	186444600	305	838	169	04	005	002300
6	3795	58	56610200	28	139	56	04	005	942201
7	3467	55	42392000	38	106	15	04	005	942202
8	4726	24	75087400	43	221	123	04	005	944900
9	4438	108	47734200	64	147	76	04	005	945000
10	3943	67	52120200	44	144	64	04	005	945100
11	4499	72	71813000	35	245	103	04	005	945200
12	6959	1076	257579300	31	476	311	04	005	000900
13	3850	636	153190500	70	261	73	04	005	000200
14	6438	2440	155474100	149	291	248	04	005	000300
15	6122	1302	99347200	172	237	158	04	005	000800
16	3517	390	159329400	46	170	93	04	005	000100
17	2720	256	112123900	59	120	74	04	005	001200
18	2903	254	109610200	73	206	45	04	005	001500
19	3706	647	129996900	44	144	97	04	005	000700
20	5414	993	142256900	111	115	144	04	005	000400
21	3619	916	90293300	108	209	130	04	005	000500
22	3475	1054	85740300	149	213	110	04	005	001700
23	7547	707	160342600	163	483	155	04	005	002100
24	12368	2344	55454100	7	37	32	04	005	001000
25	5262	302	218984700	111	583	29	04	005	000600
26	3180	276	157001500	89	125	140	04	005	001600
27	1867	174	57021800	38	189	40	04	005	002000

...and the MOEs

In [10]:

data_M

Out[10]:

	B01003_001M	B03001_003M	B19313_001M	B15002_012M	B15002_013M	B15002_014M	state	county	tract
0	406	170	22714480	42	64	52	04	005	001101
1	615	345	25521201	129	170	118	04	005	001102
2	599	255	22397589	95	114	52	04	005	001301
3	681	200	18576594	86	140	66	04	005	001302
4	635	209	63593535	57	128	141	04	005	002200
5	740	234	23082343	132	221	87	04	005	002300
6	304	37	5737985	16	35	23	04	005	942201
7	321	38	4566029	17	29	13	04	005	942202
8	517	28	10494052	35	101	58	04	005	944900
9	302	60	5082556	33	44	35	04	005	945000
10	316	33	4527569	22	37	22	04	005	945100
11	505	61	8895939	32	92	52	04	005	945200
12	575	318	45061154	37	233	195	04	005	000900
13	410	252	19648625	41	91	56	04	005	000200
14	603	543	22955692	79	92	132	04	005	000300
15	646	255	13744500	91	99	68	04	005	000800
16	277	167	17873519	37	59	50	04	005	000100
17	202	134	14820577	41	87	67	04	005	001200
18	394	160	20598280	58	98	40	04	005	001500
19	372	237	17359157	34	60	68	04	005	000700
20	409	284	15198039	69	53	78	04	005	000400
21	400	237	12601177	76	95	68	04	005	000500
22	131	165	9039383	66	67	56	04	005	001700
23	19	380	19269569	106	164	114	04	005	002100
24	1302	362	8854217	10	37	30	04	005	001000
25	455	135	36625917	96	216	31	04	005	000600
26	152	177	33347773	59	57	98	04	005	001600
27	413	132	14479669	48	86	39	04	005	002000

2. Computing margins of error on combined estimates¶

Three options are available in cenpy:

a. Analytic

b. Pseudo

c. Replicate

a. Analytic¶

The analytic approach uses classic statistical theory to compute the MOEs.

In this first example, we sum three columns to get the unpublished value for males with more than high school diploma, but less than a bachelors degree.

In [11]:

# Summing values
results_sum = analytic_utils.analytic_sum(data_E[['B15002_012E','B15002_013E','B15002_014E']], 
                                          data_M[['B15002_012M','B15002_013M','B15002_014M']])
results_sum

Out[11]:

	est	moe
0	365	92.541882
1	785	243.854465
2	592	157.241852
3	686	177.064960
4	648	198.781287
5	1312	271.724125
6	223	44.833024
7	159	36.041643
8	387	121.614144
9	287	65.192024
10	252	48.342528
11	383	110.417390
12	818	306.076788
13	404	114.446494
14	688	179.245642
15	567	150.685102
16	309	85.732141
17	253	117.213480
18	324	120.697970
19	285	96.850400
20	370	116.850332
21	447	139.373599
22	472	109.457754
23	801	226.115015
24	76	48.672374
25	723	238.396728
26	354	127.804538
27	267	105.929222

Second, we compute per capita income, which is a ratio of aggregate income to total population.

In [12]:

# Ratio
results_ratio = analytic_utils.analytic_ratio(data_E[['B19313_001E','B01003_001E']],
                                              data_M[['B19313_001M','B01003_001M']])
results_ratio

Out[12]:

	est	moe
0	33812.838864	5755.941612
1	24075.536635	3538.230172
2	35092.258810	5914.981952
3	28816.438843	4797.096911
4	47923.622866	10467.668017
5	31649.057885	5581.939327
6	14917.048748	1927.167003
7	12227.285838	1736.695411
8	15888.150656	2819.843781
9	10755.790897	1359.142260
10	13218.412376	1562.277944
11	15961.991554	2668.317933
12	37013.838195	7161.150235
13	39789.740260	6633.342759
14	24149.440820	4222.569512
15	16227.899379	2823.603357
16	45302.644299	6209.514257
17	41222.022059	6249.847617
18	37757.561144	8752.544168
19	35077.415003	5859.853587
20	26275.748061	3438.087137
21	24949.792760	4441.686777
22	24673.467626	2762.555735
23	21245.872532	2553.835527
24	4483.675614	857.494397
25	41616.248575	7835.640446
26	49371.540881	10748.973683
27	30541.938939	10285.681097

Finally, we compute the proportion Hispanic residents are of the total population.

In [13]:

# Proportion
results_prop = analytic_utils.analytic_prop(data_E[['B03001_003E','B01003_001E']],
                                            data_M[['B03001_003M','B01003_001M']])
results_prop

Out[13]:

	est	moe
0	0.134027	0.034929
1	0.238518	0.037446
2	0.115155	0.047274
3	0.150009	0.030524
4	0.122346	0.028808
5	0.108810	0.037296
6	0.015283	0.009672
7	0.015864	0.010862
8	0.005078	0.005899
9	0.024335	0.013418
10	0.016992	0.008258
11	0.016004	0.013439
12	0.154620	0.043874
13	0.165195	0.063046
14	0.379000	0.076509
15	0.212676	0.035091
16	0.110890	0.046674
17	0.094118	0.048766
18	0.087496	0.053821
19	0.174582	0.061502
20	0.183413	0.050594
21	0.253109	0.059212
22	0.303309	0.046085
23	0.093680	0.050351
24	0.189521	0.021416
25	0.057393	0.025171
26	0.086792	0.055506
27	0.093198	0.067629

b. Pseudo¶

The pseudo approach generates random draws from a distribution around each published estimate built using the published estimate and MOE. These random draws are used to compute the MOEs on the combined estimates.

Following the example using the analytic approach, we again compute the sum of the number of males with more than a high school diploma but less than a college degree. First, we need to construct a function returning the sum of the ests arguement in the pseudo function.

In [14]:

def get_sum(ests):
    return ests.sum(axis=1)

Now we pass our get_sum function to the func parameter along with the published estiamtes and margains of error we used above to pseudo. This will return the margain of errors for the summed estimates derived usign the pseudo method.

In [15]:

#Summing
pseudo_results_sum = pseudo_utils.pseudo(get_sum,
                                         ests=data_E[['B15002_012E','B15002_013E','B15002_014E']],
                                         moes=data_M[['B15002_012M','B15002_013M','B15002_014M']])
pseudo_results_sum

Out[15]:

	est	moe
0	365	84.735357
1	785	250.007884
2	592	140.238331
3	686	167.853122
4	648	181.748614
5	1312	272.964043
6	223	41.073046
7	159	37.448461
8	387	120.257335
9	287	64.421342
10	252	48.405048
11	383	106.709926
12	818	295.366184
13	404	122.884843
14	688	187.736771
15	567	150.526368
16	309	89.161340
17	253	121.917382
18	324	123.955399
19	285	89.015580
20	370	116.621658
21	447	148.584868
22	472	116.897372
23	801	253.655756
24	76	42.377793
25	723	229.177557
26	354	128.139737
27	267	106.464339

Next we will demonstrate how to derive the pseudo MOEs for a ratio calculation. Again, we create a function, get_div, to pass to pseudo.

In [16]:

def get_div(ests):
    return ests.iloc[:,0] / (ests.iloc[:,1]*1.0)

We now pass the function to pseudo as done in the summing example.

In [17]:

#Ratio
pseudo_results_ratio = pseudo_utils.pseudo(get_div,
                                           ests=data_E[['B19313_001E','B01003_001E']],
                                           moes=data_M[['B19313_001M','B01003_001M']])
pseudo_results_ratio

Out[17]:

	est	moe
0	33812.838864	5588.264669
1	24075.536635	3486.556631
2	35092.258810	5461.779371
3	28816.438843	5061.058406
4	47923.622866	9465.400697
5	31649.057885	5534.522928
6	14917.048748	2043.322715
7	12227.285838	1854.067618
8	15888.150656	2823.545890
9	10755.790897	1287.138719
10	13218.412376	1464.332840
11	15961.991554	2871.830815
12	37013.838195	6493.285858
13	39789.740260	6507.534991
14	24149.440820	4240.028495
15	16227.899379	2787.390325
16	45302.644299	5949.813275
17	41222.022059	6104.126699
18	37757.561144	7763.523896
19	35077.415003	5580.745826
20	26275.748061	3703.960585
21	24949.792760	4134.406837
22	24673.467626	2625.090636
23	21245.872532	2858.229990
24	4483.675614	907.514745
25	41616.248575	8763.624949
26	49371.540881	9494.487316
27	30541.938939	10340.741340

To produce replicable results, the seed option can be set. This parameter makes a call to numpy.random.seed. The default setting is seed=None and will produce variable results for the MOEs with each call.

In [18]:

#Ratio w/ set seed
pseudo_results_ratio_w_seed = pseudo_utils.pseudo(get_div,
                                                  ests=data_E[['B19313_001E','B01003_001E']],
                                                  moes=data_M[['B19313_001M','B01003_001M']],
                                                  seed=123)
pseudo_results_ratio_w_seed

Out[18]:

	est	moe
0	33812.838864	5199.547268
1	24075.536635	3099.242861
2	35092.258810	5456.081228
3	28816.438843	5159.557118
4	47923.622866	10679.994625
5	31649.057885	5280.682360
6	14917.048748	2124.933063
7	12227.285838	1743.989476
8	15888.150656	2542.562317
9	10755.790897	1277.229939
10	13218.412376	1603.675747
11	15961.991554	2730.129336
12	37013.838195	7144.187530
13	39789.740260	6695.631941
14	24149.440820	4223.362927
15	16227.899379	2663.508198
16	45302.644299	5497.257798
17	41222.022059	5937.705580
18	37757.561144	8643.619015
19	35077.415003	5431.386916
20	26275.748061	3082.989079
21	24949.792760	4105.625732
22	24673.467626	2643.484526
23	21245.872532	3141.699151
24	4483.675614	780.529335
25	41616.248575	7669.268454
26	49371.540881	11278.086226
27	30541.938939	9370.148160

The number of simulations run to produce the results is also adjustable through the sims parameter. Increasing the number of simulations results in more stable MOE values if the algorithm is run multiple times, but takes longer to run. The default setting is sims=100.

In [19]:

#Ratio w/ set number of sims
pseudo_results_ratio_w_sims = pseudo_utils.pseudo(get_div,
                                                  ests=data_E[['B19313_001E','B01003_001E']],
                                                  moes=data_M[['B19313_001M','B01003_001M']],
                                                  sims=1000)
pseudo_results_ratio_w_sims

Out[19]:

	est	moe
0	33812.838864	5920.022199
1	24075.536635	3486.954777
2	35092.258810	5891.939088
3	28816.438843	4746.521858
4	47923.622866	10589.112366
5	31649.057885	5628.217362
6	14917.048748	1951.230050
7	12227.285838	1748.012442
8	15888.150656	2898.571655
9	10755.790897	1359.418504
10	13218.412376	1594.214628
11	15961.991554	2683.506220
12	37013.838195	7080.390284
13	39789.740260	6418.354355
14	24149.440820	4211.844071
15	16227.899379	2791.605432
16	45302.644299	6148.847893
17	41222.022059	6114.515450
18	37757.561144	8853.231000
19	35077.415003	5962.098206
20	26275.748061	3413.756703
21	24949.792760	4305.782054
22	24673.467626	2653.020760
23	21245.872532	2623.506060
24	4483.675614	870.426504
25	41616.248575	8172.820273
26	49371.540881	10552.436291
27	30541.938939	10801.993313

The internal algorithm used to compute the MOE simulates estimates from a normal distribution, which means that fractions of people (or housing units) will be simulated. Since this is not realisitc, whole=True can be set to always generate whole values. Setting whole=False (the default) means that the distribution of simulated values we more closely match the normal distibution from which the simulatations were drawn.

In [20]:

#Sum MOE generated from whole people
pseudo_results_sum_w_whole = pseudo_utils.pseudo(get_sum, 
                                                  ests=data_E[['B15002_012E','B15002_013E','B15002_014E']],
                                                  moes=data_M[['B15002_012M','B15002_013M','B15002_014M']],
                                                  whole=True)
pseudo_results_sum_w_whole

Out[20]:

	est	moe
0	365	95.561324
1	785	263.578440
2	592	150.353658
3	686	173.452135
4	648	213.159069
5	1312	289.587681
6	223	45.635722
7	159	34.752764
8	387	109.892965
9	287	64.969844
10	252	50.101928
11	383	110.514091
12	818	336.864738
13	404	107.684536
14	688	181.048703
15	567	148.519262
16	309	89.188083
17	253	115.830401
18	324	130.379528
19	285	109.033326
20	370	98.904817
21	447	125.572913
22	472	104.921034
23	801	236.569068
24	76	50.355092
25	723	223.743146
26	354	130.769898
27	267	109.245031

In [ ]: