#!/usr/bin/env python # coding: utf-8 #

# # # Exploratory Computing with Python # *Developed by Mark Bakker* # ## Notebook 11: Distribution of the mean, hypothesis tests, and the central limit theorem # In this notebook we first investigate the distribution of the mean of a dataset, we simulate several hypothesis tests, and finish with exploring the central limit theorem. # In[1]: import numpy as np import matplotlib.pyplot as plt import numpy.random as rnd get_ipython().run_line_magic('matplotlib', 'inline') # Consider a dataset of 100 points. The data are drawn from a normal distribution with mean 4 and standard deviation 2. As we noticed before, the sample mean of the 100 data points almost always differs from 4. And every time we generate a new set of 100 points, the mean will be somewhat different. # In[2]: for i in range(5): a = 2 * rnd.standard_normal(100) + 4 print('mean a:', np.mean(a)) # In fact, the mean of the dataset itself can be considered as a random variable with a distribution of its own. # ### Sample standard deviation # The sample standard deviation $s_n$ of a dataset of $n$ values is defined as # # $s_n = \sqrt{ \frac{1}{n-1} \sum_{i=1}^n (x_i - \overline{x}_n)^2 }$ # # and can be computed with the `std` function of the `numpy` package. By default, the `std` function devides the sum by $n$ rather than by $n-1$. To divide by $n-1$, as we want for an unbiased estimate of the standard deviation, specify the keyword argument `ddof=1` in the `np.std` function. # ### Exercise 1. Histogram of the means of datasets with 100 values # Generate 1000 datasets each with 100 values drawn from a normal distribution with mean 4 and standard deviation 2; use a seed of 22. Compute the mean of each dataset and store them in an array of length 1000. Compute the mean of the means and the standard deviation of the means, and print them to the screen. Draw a boxplot of the means. In a separate figure, draw a histogram of the means. Make sure the vertical axis of the boxplot and the horizontal axis of the histogram extend from 3 to 5. # In[ ]: # Answers to Exercise 1 # ### Exercise 2. Histogram of the means of datasets with 1000 values # Repeat exercise 1 but now generate 1000 datasets each with 1000 values (rather than 100 values) drawn from the same normal distribution with mean 4 and standard deviation 2, and again with a seed of 22. Make sure the vertical axis of the boxplot and the horizontal axis of the histogram extend from 3 to 5, so that the graphs can be compared to the graphs you created in the previous exercise. Is the spread of the mean much smaller now as compared to the datasets consisting of only 100 values? # Answers to Exercise 2 # ### Sample standard deviation of the sample mean # The histogram of the means looks like the bell-shaped curve of a Normal distribution, but you may recall that it is actually a Student's $t$-distribution, also simply called a $t$-distribution. A $t$-distribution arises when estimating the mean of a normally distributed variable in situations where the sample size is relatively small and the standard deviation is unknown (as it pretty much always is in practice) and needs to be estimated from the data. # # The sample mean of a dataset of $n$ values is commonly written as $\overline{x}_n$, while the sample standard deviation is written as $s_n$ (as defined above). Here, we are computing the sample standard deviation of the sample means, which we write as $\hat{s}_n$ for a dataset of size $n$. Theoretically, the value of the sample standard deviation of the sample mean $\hat{s}_n$ is related to the sample standard deviation as (see [here](http://en.wikipedia.org/wiki/Standard_deviation#Standard_deviation_of_the_mean)) # # $\hat{s}_n = s_n / \sqrt{n}$ # ### Percentiles of $t$-distribution # You may recall that the 90% interval around the mean for a Normally distributed variable runs from $\mu-1.64\sigma$ to $\mu+1.64\sigma$. In other words, 5% of the data is expected to lie below $\mu-1.64\sigma$ and 5% of the data is expected to lie above $\mu+1.64\sigma$. What now if you forgot it is $1.64\sigma$ to the left and right of the mean? Or what if you want to know the value for some other percentile. You may look that up in a table in a Statistics book (or on the web), or use the percent point function `ppf`, which is part of any statistical distribution function defined in the `scipy.stats` package. The `ppf` function is the inverse of the cumulative distribution function. For example, `ppf(0.05)` returns the value of the data such that the cumulative distribution function is equal to 0.05 at the returned value. To find the 5% and 95% values, type (recall that by default the `norm` distribution has mean zero and standard deviation 1; you can specify different values with the `loc` and `scale` keyword arguments, respectively). # In[3]: from scipy.stats import norm xvalue_05 = norm.ppf(0.05) xvalue_95 = norm.ppf(0.95) print('5% limit:', xvalue_05) print('95% limit:', xvalue_95) print('check if it works for 5%:', norm.cdf(xvalue_05)) print('check if it works for 95%:', norm.cdf(xvalue_95)) # Next, specify a mean and standard deviation xvalue_05_musig = norm.ppf(0.05, loc=20, scale=10) # mu = 20, sigma = 10 print('5% limit with mu=20, sig=10:', xvalue_05_musig) print('check:', norm.cdf(xvalue_05_musig, loc=20, scale=10)) # A similar function exists for the $t$ distribution. The $t$-distribution takes one additional argument: the number of degrees of freedom, which is equal to the number of data points minus 1. For example, consider a sample with 40 data points, a sample mean of 20, and a sample standard deviation of the mean of 2, then the 5 and 95 percentiles are # In[4]: from scipy.stats import t xvalue_05 = t.ppf(0.05, 39, loc=20, scale=2) xvalue_95 = t.ppf(0.95, 39, loc=20, scale=2) print('5% limit: ',xvalue_05) print('95% limit: ',xvalue_95) print('check if it works for 5%:', t.cdf(xvalue_05, 39, loc=20, scale=2)) print('check if it works for 95%:', t.cdf(xvalue_95, 39, loc=20, scale=2)) # ### Exercise 3. Count the number of means outside 95 percentile # Go back to Exercise 1. Generate 1000 datasets each with 100 values drawn from a normal distribution with mean 4 and standard deviation 2. For each dataset, evaluate whether the sample mean is within the 95 percentile of the $t$-distribution around the true mean of 4 (the standard deviation of the sample mean is different every time, of course). Count how many times the sample mean is outside the 95 percentile around the true mean of the $t$ distribution. If the theory is correct, it should, of course, be the case for about 5% of the datasets. Try five different seeds and report the percentage of means in the dataset that is outside the 95 percentile around the true mean. # In[ ]: # Answers to Exercise 3 # ### Exercise 4. $t$ test on dataset of 20 values # Generate 20 datapoints from a Normal distribution with mean 39 and standard deviation 4. Use a seed of 2. Compute and report the sample mean and sample standard deviation of the dataset and the sample standard deviation of the sample mean. # In[ ]: # If you computed it correctly, the mean of the 20 data points generated above is 38.16. Somebody now claims that the 20 datapoints are taken from a distribution with a mean of 40. You are asked to decide wether the true underlying mean could indeed be 40. In statistical terms, you are asked to perform a Hypothesis test, testing the null hypothesis that the mean is 40 against the alternative hypothesis that the mean is not 40 at significance level 5%. Hence, you are asked to do a two-sided $t$-test. All you can do in Hypothesis testing it trying to reject the null hypothesis, so let's try that. Most statistics books give a cookbook recipe for performing a $t$-test. Here we will visualize the $t$-test. We reject the null hypothesis if the sample mean is outside the 95% interval around the mean of the corresponding $t$-distribution. If the mean is inside the 95% interval we can only conclude that there is not enough evidence to reject the null hypothesis. Draw the probability density function of a $t$-distribution with mean 40 and standard deviation equal to the sample standard deviation of the sample mean you computed above. Draw red vertical lines indicating the left and right limits of the 95% interval around the mean. Draw a heavy black vertical line at the position of the sample mean you computed above. Decide whether you can reject the null hypothesis that the mean is 40 and add that as a title to the figure. # In[ ]: # Answers to Exercise 4 # ### Exercise 5. Hypothesis tests on Wooden beam data # Load the data set of experiments on wooden beams stored in the file `douglas_data.csv`. First, consider the first 20 measurements of the bending strength. Compute the sample mean and the standard deviation of the sample mean. The manufacturer claims that the mean bending strength is only 50 N/mm$^2$. Perform a $t$-test (significance level 5%) with null hypothesis that the mean is indeed 50 N/mm$^2$ and alternative hypothesis that the mean is not 50 N/mm$^2$ using the approach applied in Exercise 4. # In[ ]: # Repeat the $t$-test above but now with all the measurements of the bending strength. Do you reach the same conclusion? # In[ ]: # Answers to Exercise 5 # ### Central limit theorem # So far we looked at the distribution of the sample mean of a dataset while we knew that the data was taken from a normal distribution (except for the wooden beam data, but that looked very much like a Normal distribution). Such a sample mean has a Student $t$-distribtion, which approaches the Normal distribution when the dataset is large. Actually, 100 datapoints is already enough to approach the Normal distribution fairly closely. You may check this by comparing, for example, the percent point function `ppf` of a Normal distribution with a $t$-distribution with 99 degrees of freedom, or by simply plotting the pdf of both distributions: # In[5]: print('95 percentile Standard Normal: ',norm.ppf(0.95)) print('95 percentile t-dist with n=99: ',t.ppf(0.95,99)) x = np.linspace(-4,4,100) y1 = norm.pdf(x) y2 = t.pdf(x,99) plt.plot(x,y1,'b',label='Normal') plt.plot(x,y2,'r',label='t-dist') plt.legend(); # The Central limit theorem now states that the distribution of the sample mean approaches the Normal distribution in the limit even if the dataset is drawn from an entirely different distribution! We are going to test this theorem by drawing numbers from a Gamma distribution. The Gamma distribution is a skewed distribution and takes a shape parameter $k$ and a scale parameter $\theta$, and is defined for $x>0$. Details on the Gamma distribution can be found, for example [here](http://en.wikipedia.org/wiki/Gamma_distribution). Let's set the shape parameter equal to 2 and the scale parameter equal to 1 (which happens to be the default). When the scale parameter is equal to 1, the mean is equal to the shape parameter. The pdf of the Gamma distribution for these values is shown below. The mean is indicated with the red vertical line. # In[6]: from scipy.stats import gamma x = np.linspace(1e-6, 10, 100) y = gamma.pdf(x, 2, scale=1) plt.plot(x, y) plt.axvline(2, color='r'); # Random numbers may be drawn from any distribution in the `scipy.stats` package with the `rvs` function. Here, we draw 1000 numbers and add the histogram to the previous figure # In[7]: x = np.linspace(1e-6, 10, 100) y = gamma.pdf(x, 2) plt.plot(x, y) plt.axvline(2, color='r') data = gamma.rvs(2, size=1000) plt.hist(data, bins=20, density=True); # ### Exercise 6. Explore Central Limit Theorem for Gamma Distribution # Generate $N$ datasets of 20 numbers randomly drawn from a Gamma distribution with shape parameter equal to 2 and scale equal to 1. Draw a histogram of the means of the $N$ datasets using 20 bins. On the same graph, draw the pdf of the Normal distribution using the mean of means and sample standard deviation of the means; choose the limits of the $x$-axis between 0 and 4. Make 3 graphs, for $N=100,1000,10000$ and notice that the distribution starts to approach a Normal distribution. Add a title to each graph stating the number of datasets. # In[ ]: # Answers to Exercise 6 # ### Answers to the exercises # Answers to Exercise 1 # In[8]: rnd.seed(22) mean_of_data = np.mean(2 * rnd.standard_normal((1000, 100)) + 4, 1) print('The mean of the means is:', np.mean(mean_of_data)) print('The standard deviation of the means is:', np.std(mean_of_data, ddof=1)) plt.figure() plt.boxplot(mean_of_data) plt.ylim(3, 5) plt.figure() plt.hist(mean_of_data, density=True) plt.xlim(3,5) # Back to Exercise 1 # # Answers to Exercise 2 # In[9]: rnd.seed(22) mean_of_data = np.mean(2 * rnd.standard_normal((1000, 1000)) + 4, 1) print('The mean of the means is:', np.mean(mean_of_data)) print('The standard deviation of the means is:', np.std(mean_of_data, ddof=1)) plt.figure() plt.boxplot(mean_of_data) plt.ylim(3,5) plt.figure() plt.hist(mean_of_data) plt.xlim(3, 5) # Back to Exercise 2 # # Answers to Exercise 3 # In[10]: from scipy.stats import t for s in [22, 32, 42, 52, 62]: rnd.seed(s) data = 2.0 * rnd.standard_normal((1000, 100)) + 4.0 mean = np.mean(data, 1) sighat = np.std(data, axis=1, ddof=1) / np.sqrt(100) count = 0 for i in range(1000): low = t.ppf(0.025, 99, loc=4, scale=sighat[i]) high = t.ppf(0.975, 99, loc=4, scale=sighat[i]) if (mean[i] < low) or (mean[i] > high): count += 1 print('percentage of datasets where sample mean is outside 95 percentile:', count * 100 / 1000) # Back to Exercise 3 # # Answers to Exercise 4 # In[11]: rnd.seed(2) data = 4 * rnd.standard_normal(20) + 39 mu = np.mean(data) sig = np.std(data, ddof=1) sighat = np.std(data, ddof=1) / np.sqrt(20) print('mean of the data:', mu) print('std of the data:', sig) print('std of the mean:', sighat) # In[12]: x = np.linspace(37, 43, 100) y = t.pdf(x, 19, loc=40, scale=sighat) plt.plot(x, y) perc025 = t.ppf(0.025, 19, loc=40, scale=sighat) perc975 = t.ppf(0.975, 19, loc=40, scale=sighat) plt.axvline(perc025, color='r') plt.axvline(perc975, color='r') plt.axvline(mu, color='k', lw=5) plt.title('H0 cannot be rejected'); # Back to Exercise 4 # # Answers to Exercise 5 # In[13]: from pandas import read_csv w = read_csv('douglas_data.csv', skiprows=[1], skipinitialspace=True) mu20 = np.mean(w.bstrength[:20]) sig20 = np.std(w.bstrength[:20], ddof=1) / np.sqrt(20) print('sample mean, standard deviation of sample mean: ', mu20, sig20) x = np.linspace(30,70,100) y = t.pdf(x, 19, loc=50, scale=sig20) plt.plot(x,y) perc025 = t.ppf(0.025, 19, loc=50, scale=sig20) perc975 = t.ppf(0.975, 19, loc=50, scale=sig20) plt.axvline(perc025, color='r') plt.axvline(perc975, color='r') plt.axvline(mu20, color='k', lw=4) plt.title('H0 is rejected: mean is not 50 N/mm2'); # In[14]: from pandas import read_csv w = read_csv('douglas_data.csv', skiprows=[1], skipinitialspace=True) N = len(w.bstrength) mu = np.mean(w.bstrength) sig = np.std(w.bstrength, ddof=1) / np.sqrt(N) print('sample mean, standard deviation of sample mean: ', mu, sig) x = np.linspace(30, 70, 100) y = t.pdf(x, N - 1, loc=50, scale=sig) plt.plot(x, y) perc025 = t.ppf(0.025, N - 1, loc=50, scale=sig) perc975 = t.ppf(0.975, N - 1, loc=50, scale=sig) plt.axvline(perc025, color='r') plt.axvline(perc975, color='r') plt.axvline(mu, color='k', lw=4) plt.title('Not enough evidence to reject H0: mean may very well be 50 N/mm2'); # Back to Exercise 5 # # Answers to Exercise 6 # In[15]: from scipy.stats import norm, gamma for N in [100, 1000, 10000]: data = gamma.rvs(2, size=(N, 20)) mean_of_data = np.mean(data, 1) mu = np.mean(mean_of_data) sig = np.std(mean_of_data, ddof=1) plt.figure() plt.hist(mean_of_data, bins=20, density=True) x = np.linspace(0, 4, 100) y = norm.pdf(x, loc=mu, scale=sig) plt.plot(x, y, 'r') plt.title('N=' + str(N)) # Back to Exercise 6