In this notebook we do a few simple reanalyses of the geographic locations of the births and deaths of notable individuals. A citation for the paper from which the data are drawn is below. The data can be obtained from the journal's web site.
Schich et al.
A Network Framework of Cultural History
Science 1 August 2014:
Vol. 345 no. 6196 pp. 558-562
DOI: 10.1126/science.1240064
We import our standard libraries, and also a couple of libraries for handling geographic data.
import pandas as pd
import numpy as np
from mpl_toolkits.basemap import Basemap
import pyproj
This is one of the data sets used in the Schich et al. paper:
data = pd.read_csv("SchichDataS1_FB.csv.gz", compression="gzip", encoding='utf-8')
print data.shape
print data.columns
(120211, 20) Index([u'Unnamed: 0', u'PrsID', u'PrsLabel', u'BYear', u'BLocLabel', u'BLocID', u'BLocLat', u'BLocLong', u'DYear', u'DLocLabel', u'DLocID', u'DLocLat', u'DLocLong', u'Gender', u'PerformingArts', u'Creative', u'Gov/Law/Mil/Act/Rel', u'Academic/Edu/Health', u'Sports', u'Business/Industry/Travel'], dtype='object')
In the next cell, we calculate the distance between the location where a person was born and the location where she or he died. This involves calculating geodesic distances, which can be handled using the pyproj
bindings for the proj
library. See here for more discussion about calculating geodesic distances in Python.
df = data.loc[:, ["BLocLat", "BLocLong", "DLocLat", "DLocLong"]].dropna()
dfa = np.asarray(df)
g = pyproj.Geod(ellps='WGS84')
_, _, dists = g.inv(dfa[:, 1], dfa[:, 0], dfa[:, 3], dfa[:, 2])
data["rdist"] = pd.Series(dists / 1000, df.index) # Convert distances to kilometers
A simple starting point is to look at the distribution of the distances between birth locations and death locations. There are two components to the distribution: around 15% of people were born and died in the same location; of the remainder, there is a wide distribution of distances centered at around 10^2.5 ~ 320km.
rdist = data["rdist"].dropna()
plt.clf()
plt.hist(np.log(1 + rdist) / np.log(10), bins=50)
plt.xlabel("Log10 distance (km)", size=15)
plt.ylabel("Frequency", size=15)
print np.mean(rdist == 0)
0.144753807888
We can aso look at how the distance between birth locations and death locations has changed over time. We start with a simple scatterplot.
data["AYear"] = (data["BYear"] + data["DYear"]) / 2
plt.clf()
plt.plot(data["AYear"], data["rdist"], 'o', alpha=0.2)
plt.xlabel("Year", size=15)
_ = plt.ylabel("Distance (km)", size=15)
There are far more data records from recent history, so this plot may be misleading due to overplotting. We can fit the conditional mean curve to see how the average distance has changed over time. We did this analysis on the log scale. There is a rapid increase in the mean around 1700, and perhaps a decrease in recent decades.
from statsmodels.nonparametric.smoothers_lowess import lowess
df = data.loc[:, ["rdist", "AYear"]].dropna()
df.sort(columns="AYear", inplace=True)
dfa = np.asarray(df)
dfa[:, 0] = np.log(1 + dfa[:, 0]) / np.log(10)
lfit = lowess(dfa[:, 0], dfa[:, 1], frac=0.1)
plt.clf()
plt.plot(dfa[:, 1], dfa[:, 0], 'o', color='grey', alpha=0.05)
plt.plot(lfit[:,0], lfit[:,1], '-', color='lime', lw=3, alpha=0.9)
plt.xlabel("Year", size=15)
_ = plt.ylabel("Log10 distance", size=15)
I was curious about some of the people who lived long ago and yet travelled great distances during their lives:
ii = (data["rdist"] > 3000) & (data["AYear"] < 1000)
data.loc[ii, ["PrsLabel", "AYear", "BLocLabel", "DLocLabel", "rdist"]]
Wikipedia says that Kumarajiva was born and died in China and that his father was from Kashmir.
One more drill-down check -- who has the greatest distance between their birth and death locations? The circumference of the earth is 40,075km, so the maximum possible distance is half of this value.
ii = np.argmax(rdist)
print data.loc[ii, :]
print 40075. / 2
We repeated the same analysis using only the data from 1600 to the present.
df = data.loc[:, ["rdist", "AYear"]].dropna()
df = df.loc[df["AYear"] >= 1600, :]
df.sort(columns="AYear", inplace=True)
dfa = np.asarray(df)
dfa[:, 0] = np.log(1 + dfa[:, 0]) / np.log(10)
lfit = lowess(dfa[:, 0], dfa[:, 1], frac=0.1)
plt.clf()
plt.plot(dfa[:, 1], dfa[:, 0], 'o', color='grey', alpha=0.05)
plt.plot(lfit[:,0], lfit[:,1], '-', color='lime', lw=3, alpha=0.9)
plt.xlabel("Year", size=15)
_ = plt.ylabel("Log distance", size=15)
In this section we will make some simple maps showing the birth locations.
df = data.loc[:, ["BLocLat", "BLocLong"]].dropna()
latit = np.asarray(df["BLocLat"])
longit = np.asarray(df["BLocLong"])
The following cell shows the natural way to make a map of the birth locations. However currently it doesn't work due to a bug in Basemap. See the next cell for a workaround.
#mp = Basemap()
#plt.figure(figsize=(16, 12))
#mp.drawcoastlines()
#mp.plot(longit[0:10000], latit[0:10000], '.', latlon=True, color='blue')
mp = Basemap()
plt.figure(figsize=(16, 12))
mp.drawcoastlines()
x, y = mp(longit, latit)
mp.plot(x, y, 'o', color='blue', alpha=0.5, ms=4, latlon=False)
[<matplotlib.lines.Line2D at 0x7fc6c8ae63d0>]
data.columns
Index([u'Unnamed: 0', u'PrsID', u'PrsLabel', u'BYear', u'BLocLabel', u'BLocID', u'BLocLat', u'BLocLong', u'DYear', u'DLocLabel', u'DLocID', u'DLocLat', u'DLocLong', u'Gender', u'PerformingArts', u'Creative', u'Gov/Law/Mil/Act/Rel', u'Academic/Edu/Health', u'Sports', u'Business/Industry/Travel', u'rdist', u'AYear'], dtype='object')
Someone was supposedly born in the Pacific Ocean about one third of the way from Vancouver to Honolulu. Who was this? Is the point in the correct place?
Make a histogram showing the distribution of birth years.
For the people in the dataset who were born in each century, determine the proportion who were born in the southern hemisphere. Make a graph plotting this proportion against time. Do the same for the western hemisphere.