So far have looked at univariate statistics, i.e. one variable forming one distribution. Bivariate statistics look at two variables, e.g.:
The idea is to look for relationship between variables, e.g.:
If we can show there is a relationship, then we can start to look for possible causes. To study bivariate problems visually we use scatter plots.
Here we are going to create a scatter plot of two variables (you will need to download the data file first_year_marks.csv).
%matplotlib inline
%cat first_year_marks.csv
import numpy as np
# Read in the records (remember from previous lecture - do not re-invent the wheel).
record = np.recfromcsv("./first_year_marks.csv")
print record.dtype.names
('field_mark', 'overall_year')
import matplotlib.pyplot as plt
# Now that we have the CSV headers, read records into numpy arrays.
fieldmarks = np.array(record["field_mark"], dtype=float)
yearmarks = np.array(record["overall_year"], dtype=float)
# 'bx' means blue 'x' markers
# zorder=3 to make sure this is in front of grid
plt.plot(fieldmarks, yearmarks, 'bx', markersize=5, markeredgewidth=2, zorder=3)
# Not labeling a graph is unforgivable.
plt.ylabel("Overall Year Mark",weight='bold')
plt.xlabel("Mark for Dorset",weight='bold')
# The rest is iceing
plt.xlim(0, 100) # both scales 0-100
plt.ylim(0, 100)
# The tick labels are quite long and take up a lof of space. To improve
# the readability of our code we can store them as lists named:
# xticktitles and yticktitles
xticktitles = ["0%", "3rd/Fail", "2ii/3rd", "2i/2ii", "2i/1st", "100%"]
yticktitles = ["3rd/Fail", "2ii/3rd", "2i/2ii", "2i/1st", "100%"]
# Ticks and labels to set up by category boundaries
plt.xticks([0, 40, 50, 60, 70, 100],xticktitles, rotation="vertical")
plt.yticks([40, 50, 60, 70, 100], yticktitles)
# Turn on grid with thin dashed green lines
plt.grid(linestyle='--', linewidth=.5, color='green',zorder=1)
# Titles
plt.title("Fieldwork and Year marks 2010-11",weight='bold')
# Show the chart
plt.show()
This only shows relationship visually, so while it appears from the plot that there is no strong relationship, can we be more quantitative? Are these two variables actually independent?
A common approach to bivariate data is to model it mathematically, for example by fitting a line to it (linear regression). This is not appropriate if there is only a weak relationship.
It is also possible to fix more complex curves but in this course we will restrict outselves to linear regression.
Many mathematical methods exist to find the best fitting line. By far most widely used is the least squares method which minimises the sum of squares of vertical errors. This can be computed in Python using the method scipy.stats.linregress which returns the slope, $m$, and intercept, $c$ parameters for the equation of the line:
$$y = mx + c.$$# Solution here
Plotting a scatter diagram gives a visual feel for how well two variables are related, or correlated. However, correlation can also be defined mathematically in terms of r-values - these have values in the range -1 to 1:
r-value | Interpretation |
---|---|
-1 | Perfect negative correlation |
-0.70 | Strong negative correlation |
-0.50. | Moderate negative correlation |
-0.30. | Weak negative correlation |
0. | No correlation |
0.30. | Weak correlation |
0.50. | Moderate correlation |
0.70. | Strong correlation |
1. | Perfect correlation |
There are commonly used methods for calculating correlation coefficients:
To explain them and the differences between them we need to cover a few more basic concepts, namely:
The probability, $p$ of an event occurring is defined as a value in the range $(0, 1)$ where:
A classic example is dice-throwing. The probability of throwing a 6 in one die throw is $1/6$ $(=0.167)$. The probability of all possible outcomes will sum to 1.
When we try to determine whether two variables are related/correlated we are immediately talking in terms of probability. It is possible that any apparent relationship is by chance. So we try to work out how unlikely this is as a p-value.
We use histograms of samples to get the underlying probability-distribution diagrams. We use probability-distribution curves by integrating the area under the curve to work out the probability of sampling from a range of values (can’t talk about probability of getting a particular value... why not?) The integral under the whole curve is 1 (why?)
Many real-world distributions approximate a ‘bell-shape’. This shape is known as a Normal or Gaussian distribution. It is very important in statistics and you are going to encounter it many times!
Normal distribution defined mathematically as:
$$ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$where $\sigma$ is the standard deviation and $\mu$ is the mean. You don’t need to know the formula by rote but do need to know it only depends on mean and standard deviation.
x = np.linspace(-4, 4, 100)
from scipy import stats
from numpy import sqrt, pi, exp
for mu, sigma in ((0, 1), (0, 0.7), (1, 1.5)):
y = sqrt(1/(2*pi*sigma**2))*exp(-(x-mu)**2/(2*sigma**2))
plt.plot(x, y)
plt.show()
Normal distributions are important as a lot of inferential statistical methods assume distributions are normal. However, no real-world distribution is completely normal. Often, people assume deviations from normal are minor - this can be problematic!
Consider the histogram of 2000 overall module marks from ESESIS (you will need to download the file 2000marks.csv.
record = np.recfromtxt("./2000marks.csv")
marks = np.array(record, dtype=float)
plt.hist(marks, 20, normed=1)
plt.xlabel("Student marks")
plt.show()
Is this plot normally distributed? Sort of, but notice:
This is the most widely used method to get an r-value for correlation. We will not get into the mathematics involved here but you need to be aware of its main limitation - assumes distributions being compared are normally distributed.
To further complicate matters, it is unclear how much deviation is allowed...more about this later.
You can calculate a probability value for a Pearson r-value using scipy.stats.pearsonr.
Say we calculated an r-value of 0.3 for two particular samples, and we want to know whether this comes from true correlation, or might have happened by chance. For this we have the p-value. The p-value roughly indicates the probability of an uncorrelated system producing datasets that have r-value at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so.
# Lets look at a trivial example - lets generate data where x==y.
d1 = np.linspace(-1, 1, 1000)
d2 = np.linspace(-1, 1, 1000)
r, p = stats.pearsonr(d1, d2)
print "r-value (correlation coefficient)", r
print "p-value (probability that we have got it wrong)", p
r-value (correlation coefficient) 1.0 p-value (probability that we have got it wrong) 0.0
An alternative method for correlation is Spearman’s rank correlation coefficient. This is a non-parametric statistic - meaning that it does not rely on underlying data being of any particular distribution.
You can calculate Spearman’s rank correlation coefficient using scipy.stats.spearmanr
In this case the p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Spearman correlation at least as extreme as the one computed from these datasets. As is the case with the Pearson correlation corfficient, the p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so.
Ideally use Pearson’s if data is normally distributed.
# Again lets try this with our trivial dataset
r, p = stats.spearmanr(d1, d2)
print "r-value (correlation coefficient)", r
print "p-value (probability that we have got it wrong)", p
r-value (correlation coefficient) 1.0 p-value (probability that we have got it wrong) 0.0
For the first_year_marks.csv dataset:
# Solution here
For this practical you will need the Shock response data for sand. This can be found in the file sand_swdb.txt, taken from the shock wave database.
For this practical you should make extensive use of the matplotlib documentation.