Main page: | wiki.ubc.ca/Science:Math_Education_Resources | |
---|---|---|
Contributors portal: | wiki.ubc.ca/Science:MER | |
Source of data: | google analytics, easiness rating, exam score | Contact info: | mer-wiki (at) math (dot) ubc (dot) ca |
First, let's read in the data:
%%capture
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import sys
import os
from MERhelpers import data_to_dict_clickdates_clickscount
from scipy import stats
%pylab inline
# We obtain `data/all_questions.csv` from the html source of the webpage for each question.
%less data/all_questions.csv
# This is done in a helper script:
#%run MER2csv.py
df_all = pd.read_csv(os.path.join('data','all_questions.csv'), header=None,
names=['url', 'course', 'exam', 'q_short', 'num_votes',
'rating', 'num_hints', 'num_sols'])
df_with_rating = df_all[df_all.num_votes > 0]
df_with_rating.head(5)
url | course | exam | q_short | num_votes | rating | num_hints | num_sols | |
---|---|---|---|---|---|---|---|---|
0 | http://wiki.ubc.ca/Science:Math_Exam_Resources... | MATH100 | December_2013 | 1 (a) | 1 | 95 | 1 | 1 |
1 | http://wiki.ubc.ca/Science:Math_Exam_Resources... | MATH100 | December_2013 | 1 (b) | 1 | 90 | 2 | 1 |
2 | http://wiki.ubc.ca/Science:Math_Exam_Resources... | MATH100 | December_2013 | 1 (c) | 1 | 85 | 2 | 1 |
3 | http://wiki.ubc.ca/Science:Math_Exam_Resources... | MATH100 | December_2013 | 1 (d) | 1 | 92 | 2 | 1 |
4 | http://wiki.ubc.ca/Science:Math_Exam_Resources... | MATH100 | December_2013 | 2 (a) | 2 | 90 | 2 | 1 |
To get a sense of what the rating data is like, let's aggregate it in a histogram.
all_avg_ratings = df_with_rating['rating'].astype(float).values
plt.hist(all_avg_ratings, bins=101)
plt.title('%d easiness votes' % sum(df_with_rating['num_votes']))
plt.xlabel('Average easiness rating (0=hard, 100=easy)')
plt.ylabel('count')
plt.annotate('Mean easiness: %.1f \nMedian easiness: %.1f' % (np.mean(all_avg_ratings),
np.median(all_avg_ratings)),
xy=(20, 40))
plt.show()
We see that 100% easiness is by far the most popular single rating. In general, the majority of questions is rated on the easier side. It is also interesting to observe the spikes at 0, 20, 40, 50, 60, 70, 80, 90, 95, indicating that many students vote on a simpler 11-scale basis instead of the provided 101 options.
%less data/google_analytics_20140430.csv
(dict_all, date_clicks,
num_clicks) = data_to_dict_clickdates_clickscount(os.path.join('data','google_analytics_20140430.csv'))
def add_time(url):
'''Returns avg_time for input question from google analytics'''
course, exam, question = url.split('/')[5:8]
try:
return dict_all[course][exam][question]['avg_time']
except KeyError:
try:
return dict_all[course][exam][question.replace('Question_0', 'Question_')]['avg_time']
except KeyError:
return None
df_with_rating['time'] = df_with_rating['url'].apply(add_time)
try:
df = df_with_rating.drop(['num_hints', 'num_sols'], axis=1, inplace=False).dropna()
except ValueError:
pass
df.head(5)
url | course | exam | q_short | num_votes | rating | time | |
---|---|---|---|---|---|---|---|
24 | http://wiki.ubc.ca/Science:Math_Exam_Resources... | MATH100 | December_2012 | 1 (a) | 3 | 83 | 146 |
25 | http://wiki.ubc.ca/Science:Math_Exam_Resources... | MATH100 | December_2012 | 1 (b) | 2 | 89 | 150 |
26 | http://wiki.ubc.ca/Science:Math_Exam_Resources... | MATH100 | December_2012 | 1 (c) | 1 | 95 | 165 |
27 | http://wiki.ubc.ca/Science:Math_Exam_Resources... | MATH100 | December_2012 | 2 (a) | 2 | 98 | 123 |
28 | http://wiki.ubc.ca/Science:Math_Exam_Resources... | MATH100 | December_2012 | 2 (b) | 2 | 91 | 225 |
We expect that the reason that students spend more time on a question page is because the question is harder.
More formally, let's see if we reject the Null Hypothesis:
The average time spent on a question page is independent of the perceived easiness of the question.?
all_avg_ratings = df['rating'].values.astype(float)
all_avg_times = df['time'].values.astype(float)
slope, intercept, r_value, p_value, std_err = stats.linregress(all_avg_ratings, all_avg_times)
line = slope*all_avg_ratings+intercept
plt.figure(figsize=(12,3))
plt.plot(all_avg_ratings, all_avg_times, 'ko', all_avg_ratings, line, 'r-')
plt.xlabel('Average easiness rating (0=hard, 100=easy)')
plt.ylabel('Average time spent on question page [seconds]')
plt.annotate('$r^2$ = %.2f \np = %.2e' % (r_value**2, p_value),
xy=(60, 650))
plt.show()
We see that indeed there is a negative correlation between the time spent on a question page and the perceived easiness!
The extremely small $p$-value indicates that the null-hypothesis, that there is no correlation, can be rejected with overwhelming certainty.
However the $r^2$ value is also quite small, which means that the predictive power of this correlation is quite limited. A loose interpretation of $r^2 = 0.15$ would be "Fifteen percent of the variation in the average time spent on the page can be explained by the average easiness rating of the question" (see wikipedia).
Confirming our intuition, we find a negative correlation between the easiness rating of an exam question and the average time spent on the corresponding question page: The easier the question, the less time is spent on the question page and vice versa.
At the same time, the variability is quite large, so that the average time spent on a page only predicts about 15% of the average easiness. On a per-question basis the predictive power of the average time spent on a question page to predict the average easiness of that question is therefore limited.
There are many other factors that could influence the time spent on a page like:
a. number, length and quality of hints and solutions,
b. the course level,
c. the interdependence of the question.
Overall, the time spent on a question page does not say much about the easiness.
Next, we would like to compare easiness rating with actual exam score. We have anonymous exam scores of one of the sections of the Math 103 April 2013 Exam. These exam scores will be compared to the easiness rating of the April 2014 class.
Our hypothesis is that the easiness score of the 2014 class is a good predictor for the exam score of the 2013 class.
If this holds, then the easiness score could provide valuable insight into student understanding and exam design.
We first read in the average score from the real exam and add to the data frame.
def rename(x):
'''Renames question names from Exam data to be consistent with MER question format'''
if x[0:2] == '10':
x = 'Question_' + x.strip()
else:
x = 'Question_0' + x.strip()
if len(x) == 11:
return x
if len(x) == 12:
return x[:11] + '_(' + x[11] + ')'
if len(x) > 12:
return x[:11] + '_(' + x[11] + ')_' + x[12:]
with open(os.path.join('data','MATH103_2013_FinalExamGradesPerQuestionAnonymous.csv'), 'r') as data:
for line in data:
if 'Q' == line[0]:
questions = [rename(q) for q in line.split(',')[1:]]
elif line[0] in ['M', 'S']:
continue
else:
scores = [float(s.strip()) for s in line.split(',')[1:]]
assert len(questions) == len(scores)
def add_exam_score(q_short):
'''Reads exam score from MER question_short format'''
for q_long, score in zip(questions, scores):
if q_short.replace(' ', '_') in q_long:
return score*100
df_M103_2013 = df[(df.course.str.contains('MATH103') &
df.exam.str.contains('April_2013'))]
df_M103_2013['score'] = df['q_short'].apply(add_exam_score)
df_M103_2013.head(5)
url | course | exam | q_short | num_votes | rating | time | score | |
---|---|---|---|---|---|---|---|---|
368 | http://wiki.ubc.ca/Science:Math_Exam_Resources... | MATH103 | April_2013 | 1 (a) | 24 | 57 | 148 | 81.578947 |
369 | http://wiki.ubc.ca/Science:Math_Exam_Resources... | MATH103 | April_2013 | 1 (b) | 17 | 51 | 246 | 55.921053 |
370 | http://wiki.ubc.ca/Science:Math_Exam_Resources... | MATH103 | April_2013 | 1 (c) | 26 | 25 | 281 | 32.894737 |
371 | http://wiki.ubc.ca/Science:Math_Exam_Resources... | MATH103 | April_2013 | 1 (d) | 19 | 76 | 140 | 73.026316 |
372 | http://wiki.ubc.ca/Science:Math_Exam_Resources... | MATH103 | April_2013 | 1 (e) | 12 | 43 | 178 | 64.473684 |
Looks good. Let's compare the two quantities of question easiness side-by-side.
ratings = df_M103_2013['rating'].values.astype(float)
scores = df_M103_2013['score'].values.astype(float)
bar_width = 0.4
index = np.arange(len(ratings))
alpha = 0.4
plt.figure(figsize(14,6))
plt.bar(index, ratings, bar_width,
alpha = alpha,
color='b',
label='easiness rating')
plt.bar(index + bar_width, scores, bar_width,
alpha = alpha,
color='r',
label='exam score')
plt.plot(index+bar_width, [np.mean(ratings) for x in index], 'b', alpha=alpha)
plt.plot(index+bar_width, [np.mean(scores) for x in index], 'r', alpha=alpha)
plt.legend(loc='upper left')
plt.xticks(index+bar_width, df_M103_2013['q_short'].values)
plt.xticks(rotation=70)
plt.xlabel('question number')
plt.ylabel('easiness / exam score')
plt.title('Math 103 April 2013 exam')
plt.tight_layout()
plt.show()
For each question, the left bar can be interpreted as the average prediction that students have about the score they would receive on this question on the exam, and the right bar is the average score that was actually received by students taking that exam in 2013. The horizontal lines are the average scores over all questions.
However, there are few outliers:
A. Q 7 (e) is perceived as relatively easy but was the question with the lowest score on the exam. This questions requires a complete solution of the previous parts of Q 7 and many students received a score of 0 on this question (and previous parts of Q7). So Q7(e) itself is not particularly difficult, and students could solve it, if you know the solution to Q7(a)-(d) (which students on the wiki would). This is a scenario that an exam designer wants to avoid!
B. Q 8 (a) is a simple application of the Fundamental theorem of calculus. 2014 students perceive this question as the easiest on the entire exam, yet the exam score was below average. Similarly wih Q 8 (b), another simple application of the Fundamental theorem of calculus. Q 8(c) can either be solved quickly and elegantly when the Fundamental theorem of calculus is applied, but can also be solved directly after a lengthy computation. Now students perceive the question as much harder, and indeed, the exam score is the lowest among all questions on the Fundamental theorem of calculus. This shows that 2014 students have a much better grasp of this concept than the 2013 class.
Back to the correlation between the easiness rating and the exam score. We have all the data ready, so let's have a look at the a linear regression.
exam_slope, exam_intercept, exam_r_value, exam_p_value, exam_std_err = stats.linregress(ratings, scores)
exam_line = exam_slope*ratings+exam_intercept
plt.figure(figsize=(6,6))
plt.plot(range(100), range(100), 'b--',
ratings, scores, 'ko',
ratings, exam_line, 'r-')
plt.xlabel('Easiness rating')
plt.ylabel('Exam score')
plt.grid('on')
plt.annotate('$r^2$: %.2f \n$p$: %.1e' %(exam_r_value**2, exam_p_value), xy=(10, 80))
for q, r, s in zip(df_M103_2013['q_short'].values, ratings, scores):
plt.annotate(q, xy=(r+1, s+1))
plt.xlim(0, 100)
plt.ylim(0, 100)
plt.show()
The dashed blue line indicates perfect correlation where the average easiness rating corresponds exactly to the average exam score. The slope of the red linear regression line is less than 1 and crosses the identity line close to the average scores. This indicates that students overperform on questions that they perceive as difficult and underperform on questions that they perceive as easy. Or, spinned the other way around:
Students tend to rate low-scoring exam questions as too easy, and high-scoring exam questions as too difficult.
Nor surprisingly, the very low $p$-value indicates, that there is a significant correlation between the students' perception of the difficulty of a question and their score on the exam. The low $r^2$-value is more puzzling, but in handsight reveals a flaw in our initial assumption:
Because of the interdependence of sub-questions and different emphasis on topics, we should not expect a perfect correlation* between easiness rating and exam score.*
Instead, this correlation analysis offers insight into exam design and student understanding that is valuable to improve teaching.
More precicely, our analysis allows us to
Both of these insights could be very valuable in teachers training.
Out of curiosity, how much better does the correlation get when I remove the top three questions where we would not expect a high correlation anyway?
trimmed_q_r_s = [(float(r), float(s)) for q, r, s in zip(df_M103_2013['q_short'], df_M103_2013['rating'], df_M103_2013['score']) if
(not '7 (e)' in q) and (not '8 (a)' in q) and (not '8 (b)' in q)]
trimmed_ratings, trimmed_scores = zip(*[(r, s) for r, s in trimmed_q_r_s])
trimmed_ratings = np.asarray(trimmed_ratings)
trimmed_scores = np.asarray(trimmed_scores)
(trimmed_slope, trimmed_intercept, trimmed_r_value, trimmed_p_value,
trimmed_std_err) = stats.linregress(trimmed_ratings, trimmed_scores)
trimmed_line = trimmed_slope*trimmed_ratings+trimmed_intercept
plt.figure(figsize=(6,6))
plt.plot(range(100), range(100), 'b--',
trimmed_ratings, trimmed_scores, 'ko',
trimmed_ratings, trimmed_line, 'r-')
plt.xlabel('Easiness rating')
plt.ylabel('Exam score')
plt.grid('on')
plt.annotate('$r^2$: %.2f \n$p$: %.1e' %(trimmed_r_value**2, trimmed_p_value), xy=(10, 80))
plt.xlim(0, 100)
plt.ylim(0, 100)
plt.show()
Wow, the correlation is much stronger: $p$ is smaller, $r^2$ is much larger! This makes me confident that student ratings can eventually be used to personalize practice and practice exams.
Finally, the good news for students is that they typically perform better on the exam than they think :)