#!/usr/bin/env python
# coding: utf-8

# *This post was entirely written using the IPython notebook. Its content is BSD-licensed. You can see a static view or download this notebook with the help of nbviewer at [20161015_SimpleStatisticalTestERAttendance.ipynb](http://nbviewer.ipython.org/urls/raw.github.com/flothesof/posts/master/20161015_SimpleStatisticalTestERAttendance.ipynb).*

# In this notebook, I am trying to answer a question arising from the lecture of the following article: J. S. Furyk et al., “Emergency department presentations during the State of Origin rugby league series: a retrospective statewide analysis,” The Medical journal of Australia, vol. 197, no. 11, pp. 663–666, 2012. You can find the PDF [here](https://www.mja.com.au/system/files/issues/197_11_101212/fur11557_fm.pdf).
# 
# The authors of the study conclude they have a statistically significantly lower emergency room attendance during game days. The question is: can we replicate their conclusion from their data?

# # What is this study about? 

# This article is about determining whether there were fewer visits to emergency rooms during rugby event days. The data used in this article is as follows: 
# 
# 
# > Main outcome measures: Number of patients presenting to Queensland EDs on 24 game days and 80 control days.
# 
# > In total, there were 49 702 ED presentations on game days and 172 351 presentations on control days.
# 
# 

# Let's transform this data in a neat little table:

# In[1]:


import pandas as pd
df = pd.DataFrame(data=[[24, 49702], [80, 172351]], 
                  index=['rugby game', 'control'], 
                  columns=['number of days', 'attendees'])
df


# Now, let's do some simple math. What's the average number of people in the ER for these two categories of days?

# In[2]:


df['attendees per day'] = df['attendees'] / df['number of days']
df


# We see that on average, there were less people in the ER on game days than on control days. 83 to be precise:

# In[3]:


df['attendees per day'].loc['control'] - df['attendees per day'].loc['rugby game']


# Is 83 people (more or less), a lot, compared to roughly 2100 per day?

# In[4]:


83 / 2100


# It's 4 percent. The article says that this 4 percent difference is statistically significant. How can we prove that?

# This is a tough question. We need a statiscal model to answer it.
# 
# Why? Because intuitively you know that there are many reasons to observe variations in numbers and doing it over 24 / 80 days averages the result, but maybe not enough. 

# # Single day model 

# Let's build our model for the number of patients for a single day. It is composed of:
# 
# - a mean attendance
# - a standard deviation
# 
# We'll use a gaussian distribution based on these two values. Let's write a function that simulates a single day attendance:

# In[5]:


import numpy as np
np.random.seed(42)


# In[6]:


def single_day(mean, std):
    "Simulates the number of ER attendees for a single day."
    return np.random.normal(loc=mean, scale=std)


# If we say that on average the standard deviation is a 100 people per day, we can simulate a distribution for a single day:

# In[7]:


single_day(2100, 100)


# We can do this for 24 days in a row:

# In[8]:


dist = [single_day(2100, 100) for _ in range(24)]
dist


# Let's ask pandas about this distribution's mean and standard deviation:

# In[9]:


pd.Series(dist).describe()


# One thing that appears in the above data is that the mean of the data does not necessarily reflect the mean we supplied to the generating function (2080 instead of 2100). The same goes for the standard deviation. This is a sample size effect which disappears if we work with large samples. We'll try 10x, 100x and 1000x larger samples below.

# In[10]:


pd.Series([single_day(2100, 100) for _ in range(240)]).describe()


# In[11]:


pd.Series([single_day(2100, 100) for _ in range(2400)]).describe()


# In[12]:


pd.Series([single_day(2100, 100) for _ in range(24000)]).describe()


# We see a progression: 2080, 2100, 2103, 2099. This is a wiggly thing. That's the reason for which we need to do statistical tests to see if it's a significant thing to have a difference in 4% on average.

# # Running statistical tests 

# How do statistical tests work? I'll quote an illustration from Allen Downey found in his wonderful article: ["There is still only one test"](http://allendowney.blogspot.fr/2016/06/there-is-still-only-one-test.html).

# ![onlyonetest](https://lh4.googleusercontent.com/Bud31guq0w0FvylY57VMR0zHkYqxIpYAfOqgZietyvv1n2ToNEHwHKZWYix8pwct8kDKsZKiwvOWm6PIFEL3gBIQmbakQYHwVT02nn9_H8Fht_zaSBlrRNcqwZa950Vb5nt-5B84)

# In[13]:


df['attendees'].sum() / (df['number of days'].sum())


# Our **data** is the number of ER people on game days (49 702) and on control days (172 351).
# 
# Our **model under the H0 hypothesis** ("there is no effect of rugby games") is that on average 2135.125 people visit the ER. Using our previously discussed random model, we can simulate a lot of samples of 24 / 80 days under this assumption.
# 
# **Test statistic**: using the data from the 80 control days and 24 game days, we'll simply substract the number of attendees in the control case from the number of attendees in the game case, normalizing by the number of days in control and game cases.
# 
# Let's use the [HypothesisTest class found in Allen Downey's book Think Stats2](http://greenteapress.com/thinkstats2/html/thinkstats2010.html) for this.

# In[14]:


class HypothesisTest():
    def __init__(self, data):
        self.data = data
        self.MakeModel()
        self.actual = self.TestStatistic(data)

    def PValue(self, iters=1000):
        "Returns p-value of actual data based on simulated data."
        self.test_stats = [self.TestStatistic(self.RunModel()) 
                           for _ in range(iters)]

        count = sum(1 for x in self.test_stats if x >= self.actual)
        return count / iters

    def TestStatistic(self, data):
        "Test statistic for the current test."
        raise UnimplementedMethodException()

    def MakeModel(self):
        pass

    def RunModel(self):
        "Returns a simulated data sample."
        raise UnimplementedMethodException()


# Our only needed simulation parameter is going to be the standard deviation of the attendance or "by how many people can attendance vary per day". We'll use the arbitrarily value of 100 (for now).

# In[15]:


variance_param = 100 


# We can generate our model easily with the function already defined previously:

# In[16]:


class ERTest(HypothesisTest):
    "Model for replicating the Furyk et al. (2012) study."
    def __init__(self, data, variance_param):
        super().__init__(data)
        self.variance_param = variance_param
        
    def TestStatistic(self, simulated_data):
        "The statistic is the relative difference in control vs game day attendees."
        sum_game, sum_control = simulated_data
        test_stat = (sum_control / 80. - sum_game / 24.)
        return test_stat

    def RunModel(self):
        "Returns a simulated data set of 104 observations under the H0 hypothesis."
        simulated_data = (sum(single_day(2135.125, self.variance_param) for _ in range(24)),
                         sum(single_day(2135.125, self.variance_param) for _ in range(80)))
        return simulated_data


# Good, now we're ready to run our simulation.

# In[17]:


data = (49702, 172351)


# In[18]:


ert = ERTest(data, variance_param)


# In[19]:


ert.actual


# In[20]:


ert.PValue(iters=10000)


# What this means is that using the above hypotheses, the likelihood to observe our actual result, with a difference in means of 83 people per day and the given sample sizes (24 / 80) is very unlikely to be due to chance. 
# 

# To get a better feeling for the distribution of the test statistic, let's plot it:

# In[21]:


import matplotlib.pyplot as plt
plt.style.use('bmh')
get_ipython().run_line_magic('matplotlib', 'inline')


# In[22]:


def plot_test_stat(test, title=""):
    "Plots the test statistic distribution and observed value."
    plt.hist(test.test_stats, bins=30, cumulative=False, normed=True)
    ylim = plt.ylim()
    plt.vlines(test.actual, *ylim, label='observed test stat')
    plt.legend(loc='upper left')
    plt.xlabel('test statistic: control - game days (in people per day)')
    plt.title(title)


# In[23]:


plot_test_stat(ert, "variance: {} people per day".format(variance_param))


# What this means is that if we had randomly taken 24 days and 80 days with a standard deviation of 100 people, there's almost no chance of observing a difference in number of people of 83.

# What if the standard deviation was larger? Luckily we can repeat our experiment by changing only this value.

# In[24]:


variance_param = 500
ert1 = ERTest(data, variance_param)
ert1.PValue(iters=10000)


# In[25]:


plot_test_stat(ert1, "variance: {} people per day".format(variance_param))


# Here, we see that the observed difference would be likely to be observed by chance. Having a difference of 83 people per day is likely to be normal over 24/80 days if numbers fluctuate by 500 people every day. 

# The point of all of this is: if we don't know by how much numbers fluctuate every day and how (here I used a Gaussian assumption but really, I have no idea), we can't say if this is statistically significant. I wonder why the authors have not mentioned this basic statistic of their sample in the article.

# # Conclusions and last plot 

# 
# To conclude, let's do a parametric plot of P-values, depending on the unknown parameter, the standard deviation of the number of people per day at the ER in Queensland, Australia.

# In[26]:


pvalues = []
variances = [10, 50, 100, 150, 200, 250, 300, 350, 400, 500, 750, 1000]
for variance_param in variances:
    ert2 = ERTest(data, variance_param)
    pvalues.append(ert2.PValue(iters=1000))


# In[27]:


df2 = pd.DataFrame(data= [variances, pvalues], index=['variance (people per day)', 'p-value']).transpose()
df2


# In[28]:


df2.plot.line(x='variance (people per day)', y='p-value', style='-o')


# The point that this plot makes is: depending on the attendance at the ER, the numbers used in the study can or cannot be statistically significant. If the fluctuation in attendance for a 2100 people ER is above 200 people per day (roughly 10%), the numbers from the study would be judged non significant. On the other hand, if the numbers are around 50 people per day (low fluctuation) then there is a real effect. It would be interesting to contact the author about this point.