Introduction to statistics for Geoscientists (with Python)¶

Lecturer: Gerard Gorman¶

Lecture 8: Chi-squared tests and some miscellania¶

URL: http://ggorman.github.io/Introduction-to-stats-for-geoscientists/¶

The Chi-squared test (or $\chi^2$ test)¶

The Chi-Squared test is used for discrete (categorised) data, e.g.:

Foot length – continuous.
Shoe size – discrete.

Discrete geological data might be:

Fossil type (species A, species B, etc).
Rock classification (sandstone, limestone, mudstone, etc).
Fault type (normal, thrust, strike-slip, etc).

The Chi-squared test provides a way of assessing how likely it is that counts of discrete data fit some expected pattern.

Chi-squared example¶

We have many trilobite fossils from one deposit:

Fossils are moults.
Have cranidia, librigena, and pygidia.
Should have ratio of 1:2:1.

Does our data depart from this? If it does we can infer a taphonomic bias - probably current-sorting.

trilobite

Chi-squared example¶

Chi-squared test requires an observed/expected table of this form:

	Observation count	Expected (based on 1:2:1 ratio)
Cranidia	20	17.25
Librigena	32	34.5
Pygidia	17	17.25
Total	69	69

Python provides a chi-squared test via the method scipy.stats.chisquare, which tests the null hypothesis that the categorical data has the given frequencies.

In [1]:

import numpy as np
from scipy import stats

Obs  = np.array([20, 32, 17])
Exp  = np.array([17.25, 34.5, 17.25])
s, p = stats.chisquare(Obs, Exp)
print "p-value = {}".format(p)

p-value = 0.732278624487

Therefore we accept the null hypothesis - the data sample has the expected frequencies.

Chi-squared assumptions¶

The Chi-squared test has wide applicability. Chi-Squared test quite broadly applicable. There is no requirement for anything to be normal, but:

No expected category should be less than 1 (it does not matter what the observed values are).
No more than one-fifth of expected categories should be less than 5.

Exercise 8.1: Chi-squared Test¶

Determine whether marks classifications for a course are atypical¶

Analysis of 2000 overall course marks from ESESIS shows that the typical marks breakdown is as follows:

Fail: 4.3% 3rd: 9.5% 2ii: 18.4% 2i: 38.4% 1st: 29.4%

Now consider the following distribution of results from two different groups of students:

Grade	Students - group 1	Students - group 2
Failed	3	0
3rd	10	8
2ii	23	7
2i	30	25
1st	20	39

Consider each group in turn - are their results atypical?

Tip 1: The chi-squared test is used to determine whether counts of discrete observations fit a predetermined pattern of expectations (see lecture notes).¶

scipy.stats.chisquare(Obs,Exp) – carry out a Chi-squared test on arrays Obs (Observed values) and Exp (Expected values)

Returns a tuple of (s_statistic, p_value), where s_statistic is the value of Chi-squared (as normal you can probably ignore this), and p_value is the (two-tailed) probability of this result occurring by chance (i.e. of the observations actually fitting the expectations). Chi-squared tests are always two tailed – they only ever test differences between observed and expected – there is no concept of ‘direction’ of difference.

IMPORTANT– the function takes numpy arrays, NOT normal python list. Use numpy array function to convert. Usage example:

Obs = array([20, 32, 17])

Exp = array([17.25, 34.5, 17.25])

s, p = chisquare(Obs, Exp)

Tip 2: State your hypothesis.¶

H0: Course has expected classification breakdown

H1: Course classification breakdown does not follow expected pattern

Tip 3: To calculate Chi-squared you need a list of expected values.¶

Find the total of the input values (you can use the built in sum function to do this), multiply this by each percentage value given above, and divide by 100.

Tip 4: Check the test is valid.¶

You need to check two things:

None of your expected values should be less than 1.
No more than one (i.e. more than 1/5th) of them is less than 5.

In [3]:

# solution here

Exercise 8.2¶

Every day, you visit the JCR, Library Cafe, College Cafe and all the other taste imperial outlets, and count how many Chicken and Bacon baguettes they have on sale; how many Ham and Cheese baguettes there are; and how many Carrot and Hommous baguettes there are. You record the numbers in a nice table:

Day\Baguette	C&B	H&C	C&H
Monday	32	35	38
Tuesday	20	18	30
Wednesday	27	29	8
Thursday	16	19	10
Friday	22	27	20

You have procured all this information because you read somewhere that, supposedly, 20 of each type are being added by Taste Imperial each day and that approximately 20 of each are eaten each day. You realise the ideal distribution should be:

Day\Baguette	C&B	H&C	C&H
Monday	20	20	20
Tuesday	20	20	20
Wednesday	20	20	20
Thursday	20	20	20
Friday	20	20	20

Perform a chi-squared test and see if reality matches the statistic that you read about.

(Note: All the above numbers have been invented and may not be anywhere close to the actual values)

In [ ]:

# solution here