Title: Simulating Income Distributions Author: Thomas M. Breuel
from pylab import *
from urllib2 import urlopen
from bisect import bisect
imshow(imread(urlopen("http://imgick.cleveland.com/home/cleve-media/width620/img/datacentral/photo/13425050-mmmain.png")))
<matplotlib.image.AxesImage at 0x7f91bc23d910>
data = genfromtxt("usincome.csv",skip_header=2,delimiter=",",names="lo,hi,n,mean,stddev")
for i in range(5): print data[i]
(0.0, 4999.0, 4245.0, 1249.0, 50.0) (5000.0, 9999.0, 5128.0, 7923.0, 30.0) (10000.0, 14999.0, 7149.0, 12389.0, 28.0) (15000.0, 19999.0, 7370.0, 17278.0, 26.0) (20000.0, 24999.0, 7037.0, 22162.0, 27.0)
data = array([list(x) for x in data])
plot(data[:,2])
[<matplotlib.lines.Line2D at 0x7f91bc03bd90>]
idist = concatenate([data[:-2,2],[data[-2,2]/10]*10,[data[-1,2]/50]*50])
plot(idist)
[<matplotlib.lines.Line2D at 0x7f91b222d9d0>]
cdist = add.accumulate(idist)
cdist = cdist*1.0/amax(cdist)
plot(cdist)
[<matplotlib.lines.Line2D at 0x7f91b21e2f10>]
bisect(cdist,rand())
6
def empsample(dist,n=1):
"""Given an array representing a histogram, return a random bin number
according to that histogram."""
cdist = add.accumulate(dist)
cdist = cdist*1.0/amax(cdist)
result = [bisect(cdist,rand()) for i in range(n)]
if n==1: return result[0]
else: return array(result,'f')
Let's generate a population of 100000 individuals and their incomes.
N = 100000
incomes = 5000*(empsample(idist,n=N)+rand(100000))
The income statistics match what they should be pretty well. We ought to check this because we had to guess in the construction of the histogram at the top end.
print mean(incomes)
print median(incomes)
67363.1779198 49480.6992211
_=hist(incomes,bins=200)
Usually, what we look at in order to measure inequality is the Gini index.
This is defined in terms of the Lorenz curve.
The Lorenz curve gives the cumulative fraction of the total wealth/income of all the individuals at the given income rank or below.
That is, we sort the individuals by income, accumulate the incomes, and normalize by the total.
lorenz_curve = add.accumulate(sorted(incomes))
lorenz_curve /= amax(lorenz_curve)
plot(lorenz_curve)
[<matplotlib.lines.Line2D at 0x7f91b1c52310>]
If everybody makes about the same income, the Lorenz curve is a straight line.
uniform_lorenz = add.accumulate(sorted(66000.0+0.0001*rand(100000)))
uniform_lorenz /= amax(uniform_lorenz)
plot(lorenz_curve)
plot(uniform_lorenz)
[<matplotlib.lines.Line2D at 0x7f91b20fc6d0>]
The Gini index is defined as a ratio of the areas on the Lorenz curve diagram. If the area between the line of perfect equality and the Lorenz curve is A, and the area under the Lorenz curve is B, then the Gini index is A / (A + B). Since A + B = 0.5, the Gini index is G = 2 * A or G = 1 – 2 B.
gini = 2*sum(linspace(0.0,1.0,len(lorenz_curve))-lorenz_curve)/len(lorenz_curve)
print gini
0.463563689866
That's about right; the US Gini coefficient in 2007 was 0.45.
def gini(samples):
lorenz_curve = add.accumulate(1.0*numpy.sort(samples))
lorenz_curve /= amax(lorenz_curve)
return 2*sum(linspace(0.0,1.0,len(lorenz_curve))-lorenz_curve)/len(lorenz_curve)
gini(incomes)
0.46356368986609653
Let's look at the income distribution again.
_=hist(incomes,bins=100)
Can we match this with a parametric distribution?
The uniform distribution doesn't work; its shape is all wrong.
uincomes = rand(N)*2*mean(incomes)
_=hist(uincomes,bins=100)
It produces a variance and a Gini coefficient that are too small. We can't even reproduce the US income with that.
print var(uincomes)**.5
print gini(uincomes)
38872.839274 0.333506405605
print mean(incomes)
print median(incomes)
print var(incomes)**.5
67363.1779198 49480.6992211 66558.6685609
It turns out that the log-normal distribution is a fairly good match for income distributions, except at the high end, where it produces too many outliers.
Log-normal distribution:
$X = e^{\mu+\sigma Z}$
We want to match the parameters of the log-normal distribution to the actual parameters of the US income distribution. We can do that using these formulas.
mean: $e^{\mu+\sigma^2/2}$
median: $e^\mu$
variance: $(e^{\sigma^2}-1) ~ e^{2\mu+\sigma^2}$
We cap incomes at USD 1m to avoid the outliers.
mu = 10.8
sigma = 0.82
lnincomes = minimum(1e6,exp(randn(1000000)*sigma+mu))
_=hist(lnincomes,bins=100)
print mean(lnincomes)
print median(lnincomes)
print var(lnincomes)**.5
print gini(lnincomes)
68687.6853526 48969.7035878 67032.6704021 0.438880189217
Why log-normal?
Log-normal tails are different from actual income distribution, but accounts for about 99% of incomes.
data = genfromtxt("usage.csv",skip_header=2,delimiter=",",names="lo,both,m,f")
for i in range(5): print data[i]
(0.0, 21434.0, 10955.0, 10479.0) (5.0, 20785.0, 10624.0, 10162.0) (10.0, 19893.0, 10178.0, 9714.0) (15.0, 21086.0, 10719.0, 10367.0) (20.0, 21154.0, 10684.0, 10470.0)
adist = [row[1] for row in data]
adist /= sum(adist)
plot(adist)
[<matplotlib.lines.Line2D at 0x7f91b1c04610>]
ages = (empsample(adist,n=N)+rand(N))*5.0
_=hist(ages,bins=100)
Let's look at the Gini coefficient of income distributions in the presence of growth.
gincomes = 21000*1.02**ages
print mean(gincomes)
print gini(gincomes)
_=hist(20000*1.02**ages,bins=100,range=(0,120000))
48757.2921303 0.255968790051
gincomes = 10000*1.04**ages
print mean(gincomes)
print gini(gincomes)
_=hist(gincomes,bins=100,range=(0,350000))
64330.8982347 0.474421122864
US historical GDP rate:
1950: USD 293.7b, population: 151m
2010: USD 14498b, population: 308m
100*((14498/293.7)**(1.0/60)-1.0)
6.714492285404083
gincomes = 3000*1.07**minimum(ages,60.0)
print mean(gincomes)
print gini(gincomes)
_=hist(gincomes,bins=100)
65508.3556801 0.529014284993
(Equality and Fairness)
Is this fair? Should older workers get more money due to growth?
Income growth itself is usually considered fair, but it produces inequality.
(Economics and Social Sciences View)
This model is similar to a "random walk" model used in economics; economists usually are looking for more complicated explanations:
(Important Conclusion)
Large differences in Gini coefficients can result from simple differences in overall economic success.
Differences in Gini coefficients between societies/groups do not imply lack of fairness or equal treatment.
Low growth rates imply lower Gini coefficients in simple economies.