#!/usr/bin/env python # coding: utf-8 # # Introduction to statistics for Geoscientists (with Python) # ### Lecturer: Gerard Gorman # ### Lecture 7: Hypothesis testing # ### URL: [http://ggorman.github.io/Introduction-to-stats-for-geoscientists/](http://ggorman.github.io/Introduction-to-stats-for-geoscientists/) # ## Inferential statistics # # Given the data below how do we answer questions such as - is the grain size higher in deposit B than deposit A? # # ![grain size](http://ggorman.github.io/Introduction-to-stats-for-geoscientists/images/example-grain-size.png) # # Interested in whether energy levels were different in deposits A and B. This will bear on theories of how they formed. The sample from B has a higher mean, but is this just chance variation? # # To answer this question we: # # * Formulate formal hypotheses. # * Apply approperiate statistical hypothesis test. # ## Hypotheses # # We formulate two hypotheses: # The *null hypothesis* (H0): # # > *There is no difference in mean maximum grain size for samples obtained from deposits A and B.* # # ...and an *alternative hypothesis* (H1): # # > *The mean maximum grain size is higher in sample B than it is in sample A.* # # Only accept H1 if it is very unlikely that H0 can explain our observations - otherwise reject H1. # ## Testing hypotheses # # Many different types of statistical tests exist: # # * To test different types of hypothesis. # * To handle different types of data. # ### Student’s T-test # # [The t-statistic was introduced in 1908 by William Sealy Gosset, a chemist working for the Guinness brewery in Dublin, Ireland ("Student" was his pen name).](http://en.wikipedia.org/wiki/Student%27s_t-test) Student’s T-test accesses the probability that two samples have same underlying mean. Python provides the method # [scipy.stats.ttest_ind](http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html). This is a test for the null hypothesis that 2 independent samples have identical average (expected) values. This test assumes that the populations have identical variances. # # * Takes two arrays of values (i.e. two samples). # * Returns *t-statistic*, describing how different means are. # * Returns *p-value*, tells you the probability of observing what we did if the null hypothesis is true. # # In the above grain size example the t-statistic and p-values are -1.1 and 0.28 respectively. **Is a probability of 0.28 enough to reject H0?** # In[1]: # Again lets do a trivial example to make sure we know what's going on there. import scipy.stats as stats # Generate two random datasets with a normal distribution and the same mean. d1 = stats.norm.rvs(size=100000) d2 = stats.norm.rvs(size=100000) # ttest_ind is a two-sided test for the null hypothesis that 2 independent samples # have identical average (expected) values. t_statistic, p_value = stats.ttest_ind(d1, d2) print "t-statistic = ", t_statistic print "p-value = ", p_value # ## P-value # # The p-value is used to determine the likelihood of the null hypothesis were true - it is the probability of observing a more extreme test statistic in the direction of the alternative hypothesis. For example, if the p-value is very small, then it the null hypothesis is "unlikely." If the p-value is larger than some preset threshold, then it is considered "likely." # # Therefore, if the p-value is less than (or equal to) $\alpha$, then the null hypothesis is rejected in favor of the alternative hypothesis. And, if the P-value is greater than $\alpha$, then the null hypothesis is not rejected. Commonly the threshold is set at $p=0.05$, but it is arbitrary! # # The steps involved in using the P-value approach to conducting any hypothesis test are: # # * Specify the null and alternative hypotheses and choose the appropriate test. # * Decide the significance level for the p-value, $\alpha$. # * Calculate the value of the test statistic and the p-value. # * Consider, if the null hypothesis is true, what is the probability that we'd observe a more extreme test statistic in the direction of the alternative hypothesis than we did? Note how this question is equivalent to the question answered in criminal trials: "If the defendant is innocent, what is the chance that we'd observe such extreme criminal evidence?" # * If the p-value is less than (or equal to) $\alpha$, reject the null hypothesis in favor of the alternative hypothesis. Otherwise, accept the null hypothesis. # # > In 2012 chances are that you heard in the news the discovery of the Higgs boson. The phrase *five-sigma* was used to describe the strength of the discovery. This is a p-value of 3x10$^{-7}$. [Read more.](http://blogs.scientificamerican.com/observations/five-sigmawhats-that/) # # In the grain size example above the p-value is 0.28, which is far above the 0.05 significant level. Therefore, the result said to be *not significant* - not enough evidence to reject H0. # ## Tails and significance levels # # **One-tailed test**: Form hypothesis H1: *The mean maximum grain size is higher in sample B than it is in sample A*. Now we are testing whether the mean of B is greater than mean of A, not different to the mean of A. # # The figure below illustrates the distribution of all possible t-values - the result is significant if value falls into blue region. # In[2]: import numpy as np import matplotlib.pyplot as plt n = 100 df = n-1 x = np.linspace(stats.t.ppf(0.001, df), stats.t.ppf(0.999, df), n) p = stats.t.pdf(x, df) plt.plot(x, p, 'k') plt.fill_between(x, 0, p, interpolate=True, where=(x<=stats.t.ppf(0.05, df))) plt.show() # **Two-tailed test**: Form hypothesis H1: *The mean maximum grain size is different in the two samples*. There are two possible ways it can be different - higher than *or* lower than. Two-tailed test has two *significance regions* - one in each tail - each half size of one-tailed significance region. # In[3]: plt.plot(x, p, 'k') plt.fill_between(x, 0, p, interpolate=True, where=((x<=stats.t.ppf(0.05, df)) | (x>=stats.t.ppf(0.95, df)))) plt.show() # ### Tails in practice # # 1. Decide if your test is one or two-tailed BEFORE doing it! # 2. Values returned by [scipy.stats](http://docs.scipy.org/doc/scipy/reference/stats.html) functions (e.g. [ttest_ind](http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html#scipy.stats.ttest_ind)) are for two-tailed tests (mostly!) # 1. If your test is two-tailed, there is no issue # 2. If test is one-tailed, halve the p-value before interpreting it. # In the example above the p-value was 0.28, but this is a two-tailed value while the test was one-tailed. The actual value for the test was 0.14. In this case it is still not significant (>0.05) but halving could easily make all the difference. # ## More on the t-test. # # The T-test is the most widely used statistical test, but not always appropriate. It makes several assumptions. # # * Sensible results need reasonably large sample sizes. # * This applies to other tests too. # * Assumes data in samples is normally distributed: # * May or may not really be true. # * If not, consider using non-parametric tests. # ## Non-parametric equivalents # # The most commonly used non-parametric test is the [Mann-Whitney U test](http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html#scipy.stats.mannwhitneyu). # # * Uses ranks of data (like Spearman’s correlation). # * Does not require distributions of data to be normal. # * Less powerful test (more false negatives) if distributions are normal (as all non-parametric tests). # # When a mathematical model is available the [Kolmorogov-Smirnov (K-S) test](http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html#scipy.stats.kstest) is commonly used. # # * K-S test is for ‘goodness of fit’ – how well a distribution fits a mathematic model. # * ‘Two-sample’ K-S test variant can be used as a non-parametric alternative to a t-test. # * Tests more than just difference in means - tests other aspects of distribution shape. # * Use to answer questions like 'do these two samples seem to come from the same underlying population'? # ## Correlation hypothesis tests # # We have already seen that that p-values can be obtained for correlations (Pearson or Spearman). We can test hypotheses of correlation with these. # # * H0 : ‘Dorset mark and end-of-year mark are not related’ # * H1 : ‘There is a correlation between Dorset mark and end-of-year Mark’ # # This is a two-tailed test and we will use a 5% significance level. # # * Pearson R is 0.27, p=0.014 --- significant, therefore reject H0 in favor of H1. # * Spearman R is 0.19, p=0.087 --- not significant, therefore accept H0. # # So, if we can use Pearson we can accept H1, but with Spearman we cannot - not enough evidence. We need to do a better job of deciding which to use! # # ## Tests of normality # # Rather than guessing whether a distribution is normal, can we test for it instead? # # Yes - it just becomes another hypothesis test: # # * $H_0$ : 'Underlying distribution is normal.' # * $H_1$ : 'Underlying distribution is not normal.' # # Many different tests of normality exist - no universal agreement on which is best. A good solution is *D’Agostino & Pearson combined test*, provided by Python via the method [scipy.stats.normaltest](http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html). This function tests the null hypothesis that a sample comes from a normal distribution. # # This is also a good opportunity to clearly illustrate the use of the p-value using two very clear test cases. # In[4]: import numpy as np from scipy import stats import pylab get_ipython().run_line_magic('pylab', 'inline') # Create a normal distribution. values = stats.norm.rvs(size=10000) statistic, pvalue = stats.normaltest(values) pylab.hist(values) pylab.legend(("p-value=%.2g"%pvalue, )) pylab.show() # In[5]: # However, notice how it varies for different sets of data. # Even with perfect data you can get false positives! for i in range(10): values = stats.norm.rvs(size=1000) statistic, pvalue = stats.normaltest(values) print i, "p-value = ", pvalue # In[6]: # However, the degree of varability for the p-value is put into context # when you compare it with other kinds of distributions - e.g. the exponential distribution. values = stats.expon.rvs(size=1000) statistic, pvalue = stats.normaltest(values) pylab.hist(values) pylab.legend(("p-value=%.2g"%pvalue, )) pylab.show() # # Demo - are the marks distributions normal in first year marks example? # In[7]: # Read in the records. record = np.recfromcsv("./first_year_marks.csv") print record.dtype.names # In[8]: field_mark = np.array(record["field_mark"], dtype=float) k2, p = stats.normaltest(field_mark) print "Skew = ", k2 print "H0 - field marks come from a normal distribution: ", if p<0.05: print "Reject in favor of H1." else: print "Accept." pylab.hist(field_mark, bins=5) pylab.legend(("p-value=%.2g"%p, )) pylab.show() # In[9]: overall_year = np.array(record["overall_year"], dtype=float) k2, p = stats.normaltest(overall_year) print "Skew = ", k2 print "H0 - overall year marks come from a normal distribution: ", if p<0.05: print "Reject in favor of H1." else: print "Accept." pylab.hist(overall_year, bins=5) pylab.legend(("p-value=%.2g"%p, )) pylab.show() # As at least one of these datasets does not have a normal distribution we should use Spearman's correlation test. # # Exercise 7.1: Hypothesis Testing # # 1. I have two sets of trilobite fossils from different localities. I think they are from the same species but want some quantitative confirmation – on the basis of their mean length:width ratio ([triloshape1.csv](https://raw.githubusercontent.com/ggorman/Introduction-to-stats-for-geoscientists/gh-pages/data/triloshape1.csv), [triloshape2.csv](https://raw.githubusercontent.com/ggorman/Introduction-to-stats-for-geoscientists/gh-pages/data/triloshape2.csv)) is there evidence that they are different? # 2. I have measured mica percentages in samples of igneous rocks from two different localities ([micapercent1.csv](https://raw.githubusercontent.com/ggorman/Introduction-to-stats-for-geoscientists/gh-pages/data/micapercent1.csv), [micapercent2.csv](https://raw.githubusercontent.com/ggorman/Introduction-to-stats-for-geoscientists/gh-pages/data/micapercent2.csv)). The mean mica percentage is lower in the second locality and I have a geological theory that may explain this, but I first need to rule out the possibility that the lower mica content is just due to chance. # # In each case construct (i.e. actually write down) your two hypotheses, decide from how they are worded whether they are one-tailed or two-tailed, then use your program to assess which you should accept. # # Hints: # # * Test data sets for normality (i.e. do they have normal distribution). Use the D’Agostino & Pearson test to provide probabilities of normality. # * Test whether their means differ significantly. When testing if the means are different you should use use either a two-tailed Mann-Whitney p value, or a two-tailed T-test p value depending on the result of the normality test. Remember that the scipy.stats Mann-Whitney function returns a one-tailed p-value. # # # In[10]: # Solution here # # Practical 7.2: Kolmogorov-Smirnov Test # # * Use the triolbite data above and perform a two-sided Kolmogorov-Smirnov test (two samples) on whether they come from the same distribution, or not. # * Test whether each trilobite data comes from an exponential distribution or not. (You will need to generate an exponential distribution for this question, and normalise all the distributions first) # # **NB** : The necessary function CAN be found in scipy.stats [here](http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.ks_2samp.html) # # ## BONUS (Harder) # # * Plot the cumulative distribution functions (CDF) for the trilobite data in one figure as linegraphs. # * Plot an exponential CDF on top of this as well (with a dashed line) # * Was there ever any suspicion the data might be exponential? # # **HINT** histograms have a cumulative argument which you can make use of. # In[ ]: