!date
Mon Oct 7 09:45:02 PDT 2013
Next week my work on Iraqi mortality will come out! This has been quite an involved process and there are many details that I'm excited to share. And I'm very excited and a little nervous to see how this work is received.
But for today, I am focusing on a little piece of methodology that I need again for a completely different application: bootstrap resampling.
We used bootstrap in the Iraq mortality research to provide robust uncertainty bounds for all of our estimates. For the simple estimates, based on household deaths and crude mortality rates, the bootstrap was a convenience. But for the more complicated estimates, the ones based on the sibling survival method, the bootstrap was essential.
And now for that completely different use: I have a great dataset on US hospital admissions, but I've promised not to share it in certain ways as a condition of use. And I have a great student, who is going to build me the ultimate data browser for this data. But he is not cleared to access the data yet! I don't want any delays, because UW is on the quarter system and this student's independent study will be over before I know it.
Enter the bootstrap: I will make a dataset that looks just like my great dataset, but has all of the columns resampled independently. It will be just fine for testing, and even have simple statistics that make sense. But it will not reveal any data that I am not supposed to reveal. There is an expansive theory of this sort of privacy preserving database transformation (actually several theories). Is there an easily readable result that says bootstrap resampling the columns independently leaks no information beyond publishing marginal distributions? Or that it leaks more?? (That can't be, can it?)
import numpy as np, pandas as pd
As far as doing the work, the bootstrap in Python is quite simple.
def bootstrap_resample(X, n=None):
""" Bootstrap resample an array_like
Parameters
----------
X : array_like
data to resample
n : int, optional
length of resampled array, equal to len(X) if n==None
Results
-------
returns X_resamples
"""
if n == None:
n = len(X)
resample_i = np.floor(np.random.rand(n)*len(X)).astype(int)
X_resample = X[resample_i]
return X_resample
X = arange(10000)
X_resample = bootstrap_resample(X, n=5000)
print 'original mean:', X.mean()
print 'resampled mean:', X_resample.mean()
original mean: 4999.5 resampled mean: 5030.4096
def test_bsr_shape():
# test without resampling length parameter
X = arange(10000)
X_resample = bootstrap_resample(X)
assert X_resample.shape == (10000,), 'resampled length should be 10000'
# test with resampling length parameter
n = 5000
X_resample = bootstrap_resample(X, n=n)
assert X_resample.shape == (n,), 'resampled length should be %d' % n
test_bsr_shape()
def test_bsr_mean():
# test that means are close
np.random.seed(123456) # set seed so that randomness does not lead to failed test
X = arange(10000)
X_resample = bootstrap_resample(X, 5000)
assert abs(X_resample.mean() - X.mean()) / X.mean() < 1e-2, 'means should be approximately equal'
test_bsr_mean()
I want to use this easily with Pandas, but there is a little bit of trouble with indexing to watch out for:
def test_bsr_on_df():
# test that means are close for pd.DataFrame with unusual index
np.random.seed(123456) # set seed so that randomness does not lead to failed test
X = pd.Series(arange(10000), index=arange(10000)*10)
X_resample = bootstrap_resample(X, 5000)
print X_resample.mean(), X.mean()
assert abs(X_resample.mean() - X.mean()) / X.mean() < 1e-2, 'means should be approximately equal'
test_bsr_on_df()
--------------------------------------------------------------------------- AssertionError Traceback (most recent call last) <ipython-input-47-41ba30c94f56> in <module>() 8 assert abs(X_resample.mean() - X.mean()) / X.mean() < 1e-2, 'means should be approximately equal' 9 ---> 10 test_bsr_on_df() <ipython-input-47-41ba30c94f56> in test_bsr_on_df() 6 X_resample = bootstrap_resample(X, 5000) 7 print X_resample.mean(), X.mean() ----> 8 assert abs(X_resample.mean() - X.mean()) / X.mean() < 1e-2, 'means should be approximately equal' 9 10 test_bsr_on_df() AssertionError: means should be approximately equal
504.862785863 4999.5
Why didn't that work? Because Pandas has silently dealt with the rows for which the indices are missing, returning NaNs, and then silently deat with the NaNs, dropping them from the average calculation.
def bootstrap_resample(X, n=None):
""" Bootstrap resample an array_like
Parameters
----------
X : array_like
data to resample
n : int, optional
length of resampled array, equal to len(X) if n==None
Results
-------
returns X_resamples
"""
if isinstance(X, pd.Series):
X = X.copy()
X.index = range(len(X.index))
if n == None:
n = len(X)
resample_i = np.floor(np.random.rand(n)*len(X)).astype(int)
X_resample = np.array(X[resample_i]) # TODO: write a test demonstrating why array() is important
return X_resample
test_bsr_on_df()
5035.0484 4999.5
And that is all I need to resample my dataset and make a version suitable for quick-starting an independent study:
df = pd.read_csv('/home/j/Project/Models/network_LoS/IA_2007.csv', index_col=0, low_memory=False)
# moderately large file, for a single state/year
df.shape
(367736, 361)
df_resampled = pd.DataFrame(index=df.index, columns=df.columns, dtype=df.dtypes)
for col in df.columns:
df_resampled[col] = bootstrap_resample(df[col])
df.ix[:10,:10]
age | ageday | agemonth | amonth | asource | asourceu | asource_ | atype | aweekend | died | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 85 | -999 | -999 | 1 | 5 | 1 | 1 | 3 | 0 | 0 |
1 | 92 | -999 | -999 | 1 | 5 | 1 | 1 | 3 | 0 | 1 |
2 | 79 | -999 | -999 | 2 | 5 | 1 | 1 | 2 | 0 | 0 |
3 | 85 | -999 | -999 | 1 | 1 | 7 | 7 | 1 | 1 | 0 |
4 | 79 | -999 | -999 | 1 | 5 | 1 | 1 | 3 | 1 | 0 |
5 | 70 | -999 | -999 | 1 | 5 | 1 | 1 | 3 | 0 | 0 |
6 | 86 | -999 | -999 | 1 | 5 | 1 | 1 | 3 | 0 | 0 |
7 | 86 | -999 | -999 | 3 | 5 | 1 | 1 | 2 | 1 | 0 |
8 | 86 | -999 | -999 | 3 | 5 | 1 | 1 | 3 | 1 | 0 |
9 | 79 | -999 | -999 | 2 | 5 | 1 | 1 | 3 | 0 | 0 |
10 | 84 | -999 | -999 | 3 | 5 | 1 | 1 | 3 | 0 | 0 |
df_resampled.ix[:10,:10]
age | ageday | agemonth | amonth | asource | asourceu | asource_ | atype | aweekend | died | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 45 | -999 | -999 | 5 | 1 | NaN | 7 | 3 | 1 | 0 |
1 | 72 | -999 | -999 | 4 | 5 | 1 | 7 | 3 | 0 | 0 |
2 | 66 | -999 | -999 | 1 | -999 | 1 | NaN | 4 | 0 | 0 |
3 | 50 | -999 | -999 | 5 | -999 | 1 | 7 | 3 | 0 | 0 |
4 | 37 | -999 | -999 | 12 | 5 | 3 | 2 | 3 | 0 | 0 |
5 | 41 | -999 | -999 | 6 | 5 | NaN | 7 | 3 | 0 | 0 |
6 | 68 | -999 | -999 | 7 | -999 | 7 | 7 | 3 | 0 | 0 |
7 | 66 | -999 | 0 | 7 | 5 | 1 | 7 | 1 | 0 | 0 |
8 | 22 | -999 | -999 | 10 | 1 | 7 | 1 | 3 | 0 | 0 |
9 | 62 | -999 | -999 | 8 | 5 | 1 | NaN | 3 | 0 | 0 |
10 | 76 | -999 | 0 | 10 | 3 | 1 | 1 | 3 | 0 | 0 |
df.age.mean(), df_resampled.age.mean()
(51.52543128766289, 51.475596623664806)
df_resampled.to_csv('/home/j/Project/Models/network_LoS/IA_2007_resampled.csv')