The Chi-Squared test is used for discrete (categorised) data, e.g.:
Discrete geological data might be:
The Chi-squared test provides a way of assessing how likely it is that counts of discrete data fit some expected pattern.
We have many trilobite fossils from one deposit:
Does our data depart from this? If it does we can infer a taphonomic bias - probably current-sorting.
Chi-squared test requires an observed/expected table of this form:
Observation count | Expected (based on 1:2:1 ratio) | |
---|---|---|
Cranidia | 20 | 17.25 |
Librigena | 32 | 34.5 |
Pygidia | 17 | 17.25 |
Total | 69 | 69 |
Python provides a chi-squared test via the method scipy.stats.chisquare, which tests the null hypothesis that the categorical data has the given frequencies.
import numpy as np
from scipy import stats
Obs = np.array([20, 32, 17])
Exp = np.array([17.25, 34.5, 17.25])
s, p = stats.chisquare(Obs, Exp)
print "p-value = {}".format(p)
p-value = 0.732278624487
Therefore we accept the null hypothesis - the data sample has the expected frequencies.
The Chi-squared test has wide applicability. Chi-Squared test quite broadly applicable. There is no requirement for anything to be normal, but:
Analysis of 2000 overall course marks from ESESIS shows that the typical marks breakdown is as follows:
Fail: 4.3% 3rd: 9.5% 2ii: 18.4% 2i: 38.4% 1st: 29.4%
Now consider the following distribution of results from two different groups of students:
Grade | Students - group 1 | Students - group 2 |
---|---|---|
Failed | 3 | 0 |
3rd | 10 | 8 |
2ii | 23 | 7 |
2i | 30 | 25 |
1st | 20 | 39 |
Consider each group in turn - are their results atypical?
scipy.stats.chisquare(Obs,Exp) – carry out a Chi-squared test on arrays Obs (Observed values) and Exp (Expected values)
Returns a tuple of (s_statistic, p_value), where s_statistic is the value of Chi-squared (as normal you can probably ignore this), and p_value is the (two-tailed) probability of this result occurring by chance (i.e. of the observations actually fitting the expectations). Chi-squared tests are always two tailed – they only ever test differences between observed and expected – there is no concept of ‘direction’ of difference.
IMPORTANT– the function takes numpy arrays, NOT normal python list. Use numpy array function to convert. Usage example:
Obs = array([20, 32, 17])
Exp = array([17.25, 34.5, 17.25])
s, p = chisquare(Obs, Exp)
H0: Course has expected classification breakdown
H1: Course classification breakdown does not follow expected pattern
Find the total of the input values (you can use the built in sum function to do this), multiply this by each percentage value given above, and divide by 100.
You need to check two things:
# solution here
Every day, you visit the JCR, Library Cafe, College Cafe and all the other taste imperial outlets, and count how many Chicken and Bacon baguettes they have on sale; how many Ham and Cheese baguettes there are; and how many Carrot and Hommous baguettes there are. You record the numbers in a nice table:
Day\Baguette | C&B | H&C | C&H |
---|---|---|---|
Monday | 32 | 35 | 38 |
Tuesday | 20 | 18 | 30 |
Wednesday | 27 | 29 | 8 |
Thursday | 16 | 19 | 10 |
Friday | 22 | 27 | 20 |
You have procured all this information because you read somewhere that, supposedly, 20 of each type are being added by Taste Imperial each day and that approximately 20 of each are eaten each day. You realise the ideal distribution should be:
Day\Baguette | C&B | H&C | C&H |
---|---|---|---|
Monday | 20 | 20 | 20 |
Tuesday | 20 | 20 | 20 |
Wednesday | 20 | 20 | 20 |
Thursday | 20 | 20 | 20 |
Friday | 20 | 20 | 20 |
Perform a chi-squared test and see if reality matches the statistic that you read about.
(Note: All the above numbers have been invented and may not be anywhere close to the actual values)
# solution here