Analysis of correlation between average view time, easiness rating, and exam score¶

Main page:	wiki.ubc.ca/Science:Math_Education_Resources
Contributors portal:	wiki.ubc.ca/Science:MER
Source of data:	google analytics, easiness rating, exam score
Contact info:	mer-wiki (at) math (dot) ubc (dot) ca

Founded 2012
Solved more than 1200 questions, 49 exams
More than 1 Million pageviews, 4.5 year spent on the resource in total
Very popular during exam time (and increasingly during the term)
steady growth of overall usage
Rated as most effective in Math Learning, ahead of homework, lecture, textook.

More detailed usage stats

Questions we want to answer¶

Do students spend more time on more difficult questions?
How well can students tell easy from difficult questions?
What can we learn from the students' perception for future course and exam design?

First, let's read in the data:

In [9]:

%%capture
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import sys
import os
from MERhelpers import data_to_dict_clickdates_clickscount
from scipy import stats
%pylab inline


# We obtain `data/all_questions.csv` from the html source of the webpage for each question.
%less data/all_questions.csv
# This is done in a helper script:
#%run MER2csv.py
df_all = pd.read_csv(os.path.join('data','all_questions.csv'), header=None,
                     names=['url', 'course', 'exam', 'q_short', 'num_votes',
                            'rating', 'num_hints', 'num_sols'])
df_with_rating = df_all[df_all.num_votes > 0]
df_with_rating.head(5)

Out[9]:

	url	course	exam	q_short	num_votes	rating	num_hints	num_sols
0	http://wiki.ubc.ca/Science:Math_Exam_Resources...	MATH100	December_2013	1 (a)	1	95	1	1
1	http://wiki.ubc.ca/Science:Math_Exam_Resources...	MATH100	December_2013	1 (b)	1	90	2	1
2	http://wiki.ubc.ca/Science:Math_Exam_Resources...	MATH100	December_2013	1 (c)	1	85	2	1
3	http://wiki.ubc.ca/Science:Math_Exam_Resources...	MATH100	December_2013	1 (d)	1	92	2	1
4	http://wiki.ubc.ca/Science:Math_Exam_Resources...	MATH100	December_2013	2 (a)	2	90	2	1

To get a sense of what the rating data is like, let's aggregate it in a histogram.

In [10]:

all_avg_ratings = df_with_rating['rating'].astype(float).values
plt.hist(all_avg_ratings, bins=101)
plt.title('%d easiness votes' % sum(df_with_rating['num_votes']))
plt.xlabel('Average easiness rating (0=hard, 100=easy)')
plt.ylabel('count')
plt.annotate('Mean easiness: %.1f \nMedian easiness: %.1f' % (np.mean(all_avg_ratings),
                                                                np.median(all_avg_ratings)),
             xy=(20, 40))
plt.show()

We see that 100% easiness is by far the most popular single rating. In general, the majority of questions is rated on the easier side. It is also interesting to observe the spikes at 0, 20, 40, 50, 60, 70, 80, 90, 95, indicating that many students vote on a simpler 11-scale basis instead of the provided 101 options.

Add average time from Google Analyics¶

In [11]:

%less data/google_analytics_20140430.csv

(dict_all, date_clicks,
 num_clicks) = data_to_dict_clickdates_clickscount(os.path.join('data','google_analytics_20140430.csv'))

def add_time(url):
    '''Returns avg_time for input question from google analytics'''
    course, exam, question = url.split('/')[5:8]
    try:
        return dict_all[course][exam][question]['avg_time']
    except KeyError:
        try:
            return dict_all[course][exam][question.replace('Question_0', 'Question_')]['avg_time']
        except KeyError:
            return None

df_with_rating['time'] = df_with_rating['url'].apply(add_time)
try:
    df = df_with_rating.drop(['num_hints', 'num_sols'], axis=1, inplace=False).dropna()
except ValueError:
    pass
df.head(5)

Out[11]:

	url	course	exam	q_short	num_votes	rating	time
24	http://wiki.ubc.ca/Science:Math_Exam_Resources...	MATH100	December_2012	1 (a)	3	83	146
25	http://wiki.ubc.ca/Science:Math_Exam_Resources...	MATH100	December_2012	1 (b)	2	89	150
26	http://wiki.ubc.ca/Science:Math_Exam_Resources...	MATH100	December_2012	1 (c)	1	95	165
27	http://wiki.ubc.ca/Science:Math_Exam_Resources...	MATH100	December_2012	2 (a)	2	98	123
28	http://wiki.ubc.ca/Science:Math_Exam_Resources...	MATH100	December_2012	2 (b)	2	91	225

We expect that the reason that students spend more time on a question page is because the question is harder.

More formally, let's see if we reject the Null Hypothesis:

The average time spent on a question page is independent of the perceived easiness of the question.?

In [12]:

all_avg_ratings = df['rating'].values.astype(float)
all_avg_times = df['time'].values.astype(float)
slope, intercept, r_value, p_value, std_err = stats.linregress(all_avg_ratings, all_avg_times)

line = slope*all_avg_ratings+intercept
plt.figure(figsize=(12,3))
plt.plot(all_avg_ratings, all_avg_times, 'ko', all_avg_ratings, line, 'r-')
plt.xlabel('Average easiness rating (0=hard, 100=easy)')
plt.ylabel('Average time spent on question page [seconds]')
plt.annotate('$r^2$ = %.2f \np = %.2e' % (r_value**2, p_value),
             xy=(60, 650))
plt.show()

We see that indeed there is a negative correlation between the time spent on a question page and the perceived easiness!

The extremely small $p$-value indicates that the null-hypothesis, that there is no correlation, can be rejected with overwhelming certainty.

However the $r^2$ value is also quite small, which means that the predictive power of this correlation is quite limited. A loose interpretation of $r^2 = 0.15$ would be "Fifteen percent of the variation in the average time spent on the page can be explained by the average easiness rating of the question" (see wikipedia).

Conclusion on part 1¶

Confirming our intuition, we find a negative correlation between the easiness rating of an exam question and the average time spent on the corresponding question page: The easier the question, the less time is spent on the question page and vice versa.
At the same time, the variability is quite large, so that the average time spent on a page only predicts about 15% of the average easiness. On a per-question basis the predictive power of the average time spent on a question page to predict the average easiness of that question is therefore limited.
There are many other factors that could influence the time spent on a page like:

a. number, length and quality of hints and solutions,
b. the course level,
c. the interdependence of the question.

Overall, the time spent on a question page does not say much about the easiness.

Correlation of the average rating with the average exam score¶

Next, we would like to compare easiness rating with actual exam score. We have anonymous exam scores of one of the sections of the Math 103 April 2013 Exam. These exam scores will be compared to the easiness rating of the April 2014 class.

Our hypothesis is that the easiness score of the 2014 class is a good predictor for the exam score of the 2013 class.

If this holds, then the easiness score could provide valuable insight into student understanding and exam design.

We first read in the average score from the real exam and add to the data frame.

In [13]:

def rename(x):
    '''Renames question names from Exam data to be consistent with MER question format'''
    if x[0:2] == '10':
        x = 'Question_' + x.strip()
    else:
        x = 'Question_0' + x.strip()
    if len(x) == 11:
        return x
    if len(x) == 12:
        return x[:11] + '_(' + x[11] + ')'
    if len(x) > 12:
        return x[:11] + '_(' + x[11] + ')_' + x[12:]

with open(os.path.join('data','MATH103_2013_FinalExamGradesPerQuestionAnonymous.csv'), 'r') as data:
    for line in data:
        if 'Q' == line[0]:
            questions = [rename(q) for q in line.split(',')[1:]]
        elif line[0] in ['M', 'S']:
            continue
        else: 
            scores = [float(s.strip()) for s in line.split(',')[1:]]
assert len(questions) == len(scores)

def add_exam_score(q_short):
    '''Reads exam score from MER question_short format'''
    for q_long, score in zip(questions, scores):
        if q_short.replace(' ', '_') in q_long:
            return score*100

df_M103_2013 = df[(df.course.str.contains('MATH103') & 
                df.exam.str.contains('April_2013'))]

df_M103_2013['score'] = df['q_short'].apply(add_exam_score)
df_M103_2013.head(5)

Out[13]:

	url	course	exam	q_short	num_votes	rating	time	score
368	http://wiki.ubc.ca/Science:Math_Exam_Resources...	MATH103	April_2013	1 (a)	24	57	148	81.578947
369	http://wiki.ubc.ca/Science:Math_Exam_Resources...	MATH103	April_2013	1 (b)	17	51	246	55.921053
370	http://wiki.ubc.ca/Science:Math_Exam_Resources...	MATH103	April_2013	1 (c)	26	25	281	32.894737
371	http://wiki.ubc.ca/Science:Math_Exam_Resources...	MATH103	April_2013	1 (d)	19	76	140	73.026316
372	http://wiki.ubc.ca/Science:Math_Exam_Resources...	MATH103	April_2013	1 (e)	12	43	178	64.473684

Looks good. Let's compare the two quantities of question easiness side-by-side.

In [14]:

ratings = df_M103_2013['rating'].values.astype(float)
scores = df_M103_2013['score'].values.astype(float)

bar_width = 0.4
index = np.arange(len(ratings))
alpha = 0.4

plt.figure(figsize(14,6))
plt.bar(index, ratings, bar_width,
        alpha = alpha,
        color='b',
        label='easiness rating')

plt.bar(index + bar_width, scores, bar_width,
        alpha = alpha,
        color='r',
        label='exam score')

plt.plot(index+bar_width, [np.mean(ratings) for x in index], 'b', alpha=alpha)
plt.plot(index+bar_width, [np.mean(scores) for x in index], 'r', alpha=alpha)
plt.legend(loc='upper left')
plt.xticks(index+bar_width, df_M103_2013['q_short'].values)
plt.xticks(rotation=70)
plt.xlabel('question number')
plt.ylabel('easiness / exam score')
plt.title('Math 103 April 2013 exam')
plt.tight_layout()
plt.show()

For each question, the left bar can be interpreted as the average prediction that students have about the score they would receive on this question on the exam, and the right bar is the average score that was actually received by students taking that exam in 2013. The horizontal lines are the average scores over all questions.

We observe the following:¶

In general, the student's perceived difficulty is pretty close to the real exam score. Student's "correctly" identify easy (eg Q 1 (d), Q 2 (a)) and difficult (eg Q 3 (b) and Q 7 (c)) questions.
Students perform slightly better on the final exam than they think during their exam preparation.
The average easiness is very close to the average exam score.

However, there are few outliers:

A. Q 7 (e) is perceived as relatively easy but was the question with the lowest score on the exam. This questions requires a complete solution of the previous parts of Q 7 and many students received a score of 0 on this question (and previous parts of Q7). So Q7(e) itself is not particularly difficult, and students could solve it, if you know the solution to Q7(a)-(d) (which students on the wiki would). This is a scenario that an exam designer wants to avoid!

B. Q 8 (a) is a simple application of the Fundamental theorem of calculus. 2014 students perceive this question as the easiest on the entire exam, yet the exam score was below average. Similarly wih Q 8 (b), another simple application of the Fundamental theorem of calculus. Q 8(c) can either be solved quickly and elegantly when the Fundamental theorem of calculus is applied, but can also be solved directly after a lengthy computation. Now students perceive the question as much harder, and indeed, the exam score is the lowest among all questions on the Fundamental theorem of calculus. This shows that 2014 students have a much better grasp of this concept than the 2013 class.

Back to the correlation between the easiness rating and the exam score. We have all the data ready, so let's have a look at the a linear regression.

In [15]:

exam_slope, exam_intercept, exam_r_value, exam_p_value, exam_std_err = stats.linregress(ratings, scores)

exam_line = exam_slope*ratings+exam_intercept
plt.figure(figsize=(6,6))
plt.plot(range(100), range(100), 'b--',
         ratings, scores, 'ko',
         ratings, exam_line, 'r-')

plt.xlabel('Easiness rating')
plt.ylabel('Exam score')
plt.grid('on')
plt.annotate('$r^2$: %.2f \n$p$: %.1e' %(exam_r_value**2, exam_p_value), xy=(10, 80))
for q, r, s in zip(df_M103_2013['q_short'].values, ratings, scores):
    plt.annotate(q, xy=(r+1, s+1))
plt.xlim(0, 100)
plt.ylim(0, 100)
plt.show()

The dashed blue line indicates perfect correlation where the average easiness rating corresponds exactly to the average exam score. The slope of the red linear regression line is less than 1 and crosses the identity line close to the average scores. This indicates that students overperform on questions that they perceive as difficult and underperform on questions that they perceive as easy. Or, spinned the other way around:

Students tend to rate low-scoring exam questions as too easy, and high-scoring exam questions as too difficult.

Nor surprisingly, the very low $p$-value indicates, that there is a significant correlation between the students' perception of the difficulty of a question and their score on the exam. The low $r^2$-value is more puzzling, but in handsight reveals a flaw in our initial assumption:

Because of the interdependence of sub-questions and different emphasis on topics, we should not expect a perfect correlation* between easiness rating and exam score.*

Instead, this correlation analysis offers insight into exam design and student understanding that is valuable to improve teaching.

More precicely, our analysis allows us to

reveal flaws in exam design and quantify how many points students lose because of that.
highlight topics that students are particularly good or bad, compared to other classes.

Both of these insights could be very valuable in teachers training.

Take-home message¶

Time spent on a solution page is a weak predictor of the difficulty of the question - there are just too many other factors.
In general, the easiness rating is a pretty good predictor of the difficulty of the question, with a few systematic exceptions.
Bad exam design can be detected and quantified.
Topics where students are particularly strong or weak, compared to another instructor or year are identified.

Just for fun *

Out of curiosity, how much better does the correlation get when I remove the top three questions where we would not expect a high correlation anyway?

In [16]:

trimmed_q_r_s = [(float(r), float(s)) for q, r, s in zip(df_M103_2013['q_short'], df_M103_2013['rating'], df_M103_2013['score']) if
       (not '7 (e)' in q) and (not '8 (a)' in q) and (not '8 (b)' in q)]
trimmed_ratings, trimmed_scores = zip(*[(r, s) for r, s in trimmed_q_r_s])
trimmed_ratings = np.asarray(trimmed_ratings)
trimmed_scores = np.asarray(trimmed_scores)

(trimmed_slope, trimmed_intercept, trimmed_r_value, trimmed_p_value,
 trimmed_std_err) = stats.linregress(trimmed_ratings, trimmed_scores)

trimmed_line = trimmed_slope*trimmed_ratings+trimmed_intercept
plt.figure(figsize=(6,6))
plt.plot(range(100), range(100), 'b--',
         trimmed_ratings, trimmed_scores, 'ko',
         trimmed_ratings, trimmed_line, 'r-')

plt.xlabel('Easiness rating')
plt.ylabel('Exam score')
plt.grid('on')
plt.annotate('$r^2$: %.2f \n$p$: %.1e' %(trimmed_r_value**2, trimmed_p_value), xy=(10, 80))
plt.xlim(0, 100)
plt.ylim(0, 100)
plt.show()

Wow, the correlation is much stronger: $p$ is smaller, $r^2$ is much larger! This makes me confident that student ratings can eventually be used to personalize practice and practice exams.

Finally, the good news for students is that they typically perform better on the exam than they think :)

Fin¶

In [16]: