2.19 Statistics for Geoscientists class test - 14:00-17:00 29th May 2015¶

Test instructions¶

This test contains 5 questions each of which should be answered.
Write your program in a Python cell just under each question.
You should write an explanation of your solution as comments in your code.
In each case your solution program must fulfil all of the instructions - please check the instructions carefully and double check that your program fulfils all of the given instructions.
Save your work regularly.
At the end of the test you should email your IPython notebook document (i.e. this document) to Gerard J. Gorman at g.gorman@imperial.ac.uk

Question preamble¶

You hypothesise that there may be a link between temperature and the level of green house gasses in the atmosphere. As part of your investigation to establish if there is a correlation you analyise ice core data taken from Vostok Station in Antarctica. The data file that you will be using, VostokStation.csv, constains reconstructed temperature, CO2 gas concentration and CH4 gas concentrations stretching back 160,000 years.

Question 1¶

Read in the data from the file VostokStation.csv into NumPy arrays called year, temperature, co2 and ch4. [10 marks]
Print out the mean, minimum and maximum temperature, CO2 concentration and CH4 contentration from this data. [10 marks]

In [1]:

# Real start to solution.
%pylab inline
import scipy
from scipy import stats

fh = open("../data/VostokStation.csv", "r")
records = np.recfromcsv(fh)
print records.dtype.names

Populating the interactive namespace from numpy and matplotlib
('ice_age_ka', 'temperature_degrees_c', 'co2_ppmv', 'ch4_ppbv')

In [2]:

year = np.array(records['ice_age_ka'], dtype=float)
temperature = np.array(records['temperature_degrees_c'], dtype=float)
co2 = np.array(records['co2_ppmv'], dtype=float)
ch4 = np.array(records['ch4_ppbv'], dtype=float)

print "Temperature (min, max): ", min(temperature), max(temperature)
print "CO2 (min, max): ", min(co2), max(co2)
print "CH4 (min, max): ", min(ch4), max(ch4)

Temperature (min, max):  -65.2717 -52.33
CO2 (min, max):  179.86 291.71
CH4 (min, max):  318.25 693.41

Question 2¶

Plot line graphs of temperature, CO2 concentration and CH4 concentration against year. [10 marks]
Remember to label all your axes. [10 marks]

In [3]:

plot(-year, temperature)
legend(("temp",))

Out[3]:

<matplotlib.legend.Legend at 0x10446e3d0>

In [4]:

plot(-year, co2/np.mean(co2))
plot(-year, ch4/np.mean(ch4))
legend(("co2", "ch4"))

Out[4]:

<matplotlib.legend.Legend at 0x107761d10>

Question 3¶

Create scatter plots for both: temperature vs CO2 concentration; and temperature vs CH4 concentration. [5 marks]
Using linear regression, fit a straight line to each set of data and plot this line on the same graph. [5 marks]
What does the p-value from linear regression tell you? [5 marks]

Remember to label the x-axis and y-axis. [5 marks]

In [5]:

scatter(temperature, co2)
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(temperature, co2)
temp_co2_fit = slope*temperature+intercept
plot(temperature, temp_co2_fit)

print "p-value for linear regression: ", p_value

p-value for linear regression:  5.50525305548e-45

In [6]:

scatter(temperature, ch4)

slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(temperature, ch4)
temp_ch4_fit = slope*temperature+intercept
plot(temperature, temp_ch4_fit)

print "p-value for linear regression: ", p_value

p-value for linear regression:  1.07778088456e-34

Question 4¶

Plot a histogram of the difference between CO2 concentration values and the line fit in the scatter plot above. [10 marks]
Use scipy.stats.normaltest to determine if this data is normally distributed. [10 marks]

HINT: Take, for example, CO2 vs temperature. When you use linear regression to fit a straight line $y = mx+c$, where $m$ is the slope and $c$ is the $intercept$. The variation in C02 around the straight line model is therefore:

co2_variation = co2 - (m*temperature+c)

where co2_variation, co2 and temperature are all NumPy arrays.

In [7]:

hist(co2, 20, normed=1)

Out[7]:

(array([ 0.00186261,  0.00186261,  0.01769483,  0.01862614,  0.0130383 ,
         0.0130383 ,  0.00651915,  0.0139696 ,  0.0139696 ,  0.0130383 ,
         0.01024437,  0.00465653,  0.00465653,  0.00838176,  0.00558784,
         0.01024437,  0.0139696 ,  0.00558784,  0.        ,  0.00186261]),
 array([ 179.86  ,  185.4525,  191.045 ,  196.6375,  202.23  ,  207.8225,
         213.415 ,  219.0075,  224.6   ,  230.1925,  235.785 ,  241.3775,
         246.97  ,  252.5625,  258.155 ,  263.7475,  269.34  ,  274.9325,
         280.525 ,  286.1175,  291.71  ]),
 <a list of 20 Patch objects>)

In [8]:

hist(co2-temp_co2_fit, 20, normed=1)
pvalue = scipy.stats.normaltest(co2-temp_co2_fit)[1]
print "p-value ", pvalue
print "Normal distribution: ",
if pvalue<0.05:
    print "No"
else:
    print "Yes"

p-value  0.094192591239
Normal distribution:  Yes

In [9]:

hist(ch4-temp_ch4_fit, 20, normed=1)
pvalue = scipy.stats.normaltest(ch4-temp_ch4_fit)[1]
print "p-value ", pvalue
print "Normal distribution: ",
if pvalue<0.05:
    print "No"
else:
    print "Yes"

p-value  2.10031574793e-08
Normal distribution:  No

Question 5¶

Using the approperiate correlation test in each case, determine if there is a correlation between: temperature and CO2 concentration; temperature and CH4 concentration. Explain both your choice of correlation statistic and your conclusion.

Hint. Use the scipy.stats.normaltest to check if the samples are normally distributed.

Right answer with right tests. [5 marks]
Clear explination of decisions. [5 marks]
Quality of programming. For example, do they use a function rather than cut and pasting. Are if statements being used to control decisions? [10 marks]

In [10]:

pvalue_temp_normtest = scipy.stats.normaltest(temperature)[1]

def test_correlation(data):
    pvalue = scipy.stats.normaltest(data)[1]
    if pvalue<0.05 and pvalue_temp_normtest<0.05:
        print "Using Spearman's correlation statistic."
        rvalue, pvalue = scipy.stats.spearmanr(temperature, data)
        print "r-value: ", rvalue
        print "Prabability this correlation is by chance: ", pvalue
        if pvalue<0.05:
            print "Accept hypothesis that a correlation exists."
        else:
            print "Accept hypothesis that a correlation exists."
    else:
        print "Using Pearson's correlation statistic."
        rvalue, pvalue = scipy.stats.pearsonr(temperature, data)
        print "r-value: ", rvalue
        print "Prabability this correlation is by chance: ", pvalue
        if pvalue<0.05:
            print "Accept hypothesis that a correlation exists."
        else:
            print "Accept hypothesis that a correlation exists."


print "Testing for correlation between CO2 and temperature."
test_correlation(co2)

print "\nTesting for correlation between CH4 and temperature."
test_correlation(ch4)

Testing for correlation between CO2 and temperature.
Using Spearman's correlation statistic.
r-value:  0.74440421002
Prabability this correlation is by chance:  3.65140158875e-35
Accept hypothesis that a correlation exists.

Testing for correlation between CH4 and temperature.
Using Spearman's correlation statistic.
r-value:  0.70776821251
Prabability this correlation is by chance:  1.71684683693e-30
Accept hypothesis that a correlation exists.

In [10]: