For the first_year_marks.csv dataset, calculate r and p values using both Pearson’s and Spearman’s correlation methodologies.
%pylab inline
import numpy as np
# Read in the records.
record = np.recfromcsv("../data/first_year_marks.csv")
print record.dtype.names
Populating the interactive namespace from numpy and matplotlib ('field_mark', 'overall_year')
field_mark = np.array(record["field_mark"], dtype=float)
overall_year = np.array(record["overall_year"], dtype=float)
from scipy import stats
print "Pearson's r and p values: %g, %g"%stats.pearsonr(field_mark, overall_year)
print "Spearman's r and p values: %g, %g"%stats.spearmanr(field_mark, overall_year)
Pearson's r and p values: 0.269354, 0.0144025 Spearman's r and p values: 0.18995, 0.087402
Are these data correlated at all? If so is the correlation weak or strong?
The Pearson's correlaction suggests a very weak correlation, while the Spearman's correlaction suggests no correlation. Considering that we already know from the lecture that the marks do not follow a normal distribution (indeed this is something we are going to show next), we should conclude that we should use the Spearman's correlation.
Therefore, using Spearman's correlation coefficient we can conclude that there is no correlation. However, given the relatively high p value we would have to conlude there was not sufficient evidence to call it. More details in the next lecture.
Plot histograms of each of the two variables, and overlay normal curves to the histograms – how well do they match?
n, bins, patches = pylab.hist(field_mark, normed=1)
# Add a 'best fit' line
sigma = np.std(field_mark)
mu = np.mean(field_mark)
y = pylab.normpdf(bins, mu, sigma)
l = pylab.plot(bins, y, 'r--', linewidth=1)
pylab.xlabel("Final percentage mark")
pylab.ylabel("Proportion of students achieving mark")
pylab.show()
n, bins, patches = pylab.hist(overall_year, normed=1)
# Add a 'best fit' line
sigma = np.std(overall_year)
mu = np.mean(overall_year)
y = pylab.normpdf(bins, mu, sigma)
l = pylab.plot(bins, y, 'r--', linewidth=1)
pylab.xlabel("Final percentage mark")
pylab.ylabel("Proportion of students achieving mark")
pylab.show()