If you've got an ordered list as output and an ordered list for reference you can now use python to easily calculate these sweet rank-based metrics: Kendall's Tau, Mean Reciprocal Rank, Normalized Discounted Cumulative Gain and Expected Reciprocal Rank.
If you're into Machine Learning, you probably already know that Learning to Rank is the new black. RankEval is a set of python modules that implement some of the metrics that came out of Yahoo's Learning to Rank competition. It was made for use in Machine Translation but works elsewhere, too. Since it's new, it doesn't have much (any) documentation. This fills in the blanks.
This part let's python know about the library. You should modify rank_eval_dir
to point to the place where you downloaded it. You can get rid of this by modifying PYTHONPATH
import sys
rank_eval_dir = './rankeval-master/src/'
sys.path.append(rank_eval_dir)
The biggest part of Data Science (maybe programming in general) is schlepping data. Don't let your maths professors fool you, the first step to getting the pretty graphs is mucking around with the numbers. And so it is here.
Basically, you pass these metrics an ordered list (that's the Rank part). Actually, you need two. The first is a prediction (the output of your model), the second is a reference (probably some simple transformation of the input). The metrics give you back some measure of how good your predicted ranking is.
RankEval comes with a class and some utility functions to help define the Ranking. You'll need to wrap your input in an instance of this to use the metrics. Here's an example of how to use it.
from sentence.ranking import Ranking
reference_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
predicted_list = [2, 1, 4, 3, 6, 5, 8, 7, 10, 9, 12, 11]
reference_ranking = Ranking(reference_list)
predicted_ranking = Ranking(predicted_list)
There are two versions of each metric included in the library: an aggregate (set) version and an individual version. Here's how the individual version of Kendall's Tau works. It builds on the previous code block, where you create your Ranking objects.
from evaluation.ranking.segment import kendall_tau
result = kendall_tau(predicted_ranking, reference_ranking)
result.tau
0.8181818181818182
More individual metrics are available. Import them as reciprocal_rank
and ndgc_err
from the same module.
This is where it gets real. When you adapt this code for your own problems this is probably what you'll start with.
from evaluation.ranking.set import kendall_tau_set
# make a second example prediction
predicted_list_two = [12, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1]
predicted_ranking_two = Ranking(predicted_list_two)
predicted_set = [predicted_ranking, predicted_ranking_two]
reference_set = [reference_ranking, reference_ranking]
result_set = kendall_tau_set(predicted_set, reference_set)
result_set
{'tau': 0.5909090909090909, 'tau_all_pairs': 132, 'tau_avg_seg': 0.59090909090909094, 'tau_avg_seg_prob': 4.2322067287064823e-27, 'tau_concordant': 105, 'tau_discordant': 27, 'tau_original_ties': 0, 'tau_predicted_ties': 0, 'tau_predicted_ties_per': 0.0, 'tau_prob': 9.1709457975947886e-24, 'tau_sentence_ties': 0, 'tau_sentence_ties_per': 0.0, 'tau_valid_pairs': 132}
More metrics: mrr
, avg_ndgc_err
This is a pretty simple example, but it should be enough to get you started. In real world applications the elements in your reference set are likely to be unique (instead of duplicates, as in this example) and each entry should line up with it's corresponding prediction in the predicted set.
If your reference does turn out to be some simple transformation of your input, you should probably use some kind of Cross-Validation.
May your beards always be full and your frames thick.