You hypothesise that there may be a link between temperature and the level of green house gasses in the atmosphere. As part of your investigation to establish if there is a correlation you analyise ice core data taken from Vostok Station in Antarctica. The data file that you will be using, VostokStation.csv, constains reconstructed temperature, CO2 gas concentration and CH4 gas concentrations stretching back 160,000 years.
# Real start to solution.
%pylab inline
import scipy
from scipy import stats
fh = open("../data/VostokStation.csv", "r")
records = np.recfromcsv(fh)
print records.dtype.names
Populating the interactive namespace from numpy and matplotlib ('ice_age_ka', 'temperature_degrees_c', 'co2_ppmv', 'ch4_ppbv')
year = np.array(records['ice_age_ka'], dtype=float)
temperature = np.array(records['temperature_degrees_c'], dtype=float)
co2 = np.array(records['co2_ppmv'], dtype=float)
ch4 = np.array(records['ch4_ppbv'], dtype=float)
print "Temperature (min, max): ", min(temperature), max(temperature)
print "CO2 (min, max): ", min(co2), max(co2)
print "CH4 (min, max): ", min(ch4), max(ch4)
Temperature (min, max): -65.2717 -52.33 CO2 (min, max): 179.86 291.71 CH4 (min, max): 318.25 693.41
plot(-year, temperature)
legend(("temp",))
<matplotlib.legend.Legend at 0x10446e3d0>
plot(-year, co2/np.mean(co2))
plot(-year, ch4/np.mean(ch4))
legend(("co2", "ch4"))
<matplotlib.legend.Legend at 0x107761d10>
Remember to label the x-axis and y-axis. [5 marks]
scatter(temperature, co2)
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(temperature, co2)
temp_co2_fit = slope*temperature+intercept
plot(temperature, temp_co2_fit)
print "p-value for linear regression: ", p_value
p-value for linear regression: 5.50525305548e-45
scatter(temperature, ch4)
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(temperature, ch4)
temp_ch4_fit = slope*temperature+intercept
plot(temperature, temp_ch4_fit)
print "p-value for linear regression: ", p_value
p-value for linear regression: 1.07778088456e-34
HINT: Take, for example, CO2 vs temperature. When you use linear regression to fit a straight line $y = mx+c$, where $m$ is the slope and $c$ is the $intercept$. The variation in C02 around the straight line model is therefore:
co2_variation = co2 - (m*temperature+c)
where co2_variation, co2 and temperature are all NumPy arrays.
hist(co2, 20, normed=1)
(array([ 0.00186261, 0.00186261, 0.01769483, 0.01862614, 0.0130383 , 0.0130383 , 0.00651915, 0.0139696 , 0.0139696 , 0.0130383 , 0.01024437, 0.00465653, 0.00465653, 0.00838176, 0.00558784, 0.01024437, 0.0139696 , 0.00558784, 0. , 0.00186261]), array([ 179.86 , 185.4525, 191.045 , 196.6375, 202.23 , 207.8225, 213.415 , 219.0075, 224.6 , 230.1925, 235.785 , 241.3775, 246.97 , 252.5625, 258.155 , 263.7475, 269.34 , 274.9325, 280.525 , 286.1175, 291.71 ]), <a list of 20 Patch objects>)
hist(co2-temp_co2_fit, 20, normed=1)
pvalue = scipy.stats.normaltest(co2-temp_co2_fit)[1]
print "p-value ", pvalue
print "Normal distribution: ",
if pvalue<0.05:
print "No"
else:
print "Yes"
p-value 0.094192591239 Normal distribution: Yes
hist(ch4-temp_ch4_fit, 20, normed=1)
pvalue = scipy.stats.normaltest(ch4-temp_ch4_fit)[1]
print "p-value ", pvalue
print "Normal distribution: ",
if pvalue<0.05:
print "No"
else:
print "Yes"
p-value 2.10031574793e-08 Normal distribution: No
Using the approperiate correlation test in each case, determine if there is a correlation between: temperature and CO2 concentration; temperature and CH4 concentration. Explain both your choice of correlation statistic and your conclusion.
Hint. Use the scipy.stats.normaltest to check if the samples are normally distributed.
pvalue_temp_normtest = scipy.stats.normaltest(temperature)[1]
def test_correlation(data):
pvalue = scipy.stats.normaltest(data)[1]
if pvalue<0.05 and pvalue_temp_normtest<0.05:
print "Using Spearman's correlation statistic."
rvalue, pvalue = scipy.stats.spearmanr(temperature, data)
print "r-value: ", rvalue
print "Prabability this correlation is by chance: ", pvalue
if pvalue<0.05:
print "Accept hypothesis that a correlation exists."
else:
print "Accept hypothesis that a correlation exists."
else:
print "Using Pearson's correlation statistic."
rvalue, pvalue = scipy.stats.pearsonr(temperature, data)
print "r-value: ", rvalue
print "Prabability this correlation is by chance: ", pvalue
if pvalue<0.05:
print "Accept hypothesis that a correlation exists."
else:
print "Accept hypothesis that a correlation exists."
print "Testing for correlation between CO2 and temperature."
test_correlation(co2)
print "\nTesting for correlation between CH4 and temperature."
test_correlation(ch4)
Testing for correlation between CO2 and temperature. Using Spearman's correlation statistic. r-value: 0.74440421002 Prabability this correlation is by chance: 3.65140158875e-35 Accept hypothesis that a correlation exists. Testing for correlation between CH4 and temperature. Using Spearman's correlation statistic. r-value: 0.70776821251 Prabability this correlation is by chance: 1.71684683693e-30 Accept hypothesis that a correlation exists.