Notebook

Hipster Machine Learning Metrics Made Easy¶

Author: Waylon Flinn ( waylonflinn AT crunchmagic.com )
License: WTFPL
Original: https://gist.github.com/waylonflinn/8338948

TL;DR¶

If you've got an ordered list as output and an ordered list for reference you can now use python to easily calculate these sweet rank-based metrics: Kendall's Tau, Mean Reciprocal Rank, Normalized Discounted Cumulative Gain and Expected Reciprocal Rank.

Introduction¶

If you're into Machine Learning, you probably already know that Learning to Rank is the new black. RankEval is a set of python modules that implement some of the metrics that came out of Yahoo's Learning to Rank competition. It was made for use in Machine Translation but works elsewhere, too. Since it's new, it doesn't have much (any) documentation. This fills in the blanks.

Step One, Get it from Github¶

You're pro. This is a softball to get you warmed up.

Check out the repo or download the zip.

Step Two, Setup Python¶

This part let's python know about the library. You should modify rank_eval_dir to point to the place where you downloaded it. You can get rid of this by modifying PYTHONPATH

In [1]:

import sys
rank_eval_dir = './rankeval-master/src/'
sys.path.append(rank_eval_dir)

Step Three, Prepare Your Data¶

The biggest part of Data Science (maybe programming in general) is schlepping data. Don't let your maths professors fool you, the first step to getting the pretty graphs is mucking around with the numbers. And so it is here.

Basically, you pass these metrics an ordered list (that's the Rank part). Actually, you need two. The first is a prediction (the output of your model), the second is a reference (probably some simple transformation of the input). The metrics give you back some measure of how good your predicted ranking is.

RankEval comes with a class and some utility functions to help define the Ranking. You'll need to wrap your input in an instance of this to use the metrics. Here's an example of how to use it.

In [2]:

from sentence.ranking import Ranking

reference_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
predicted_list = [2, 1, 4, 3, 6, 5, 8, 7, 10, 9, 12, 11]

reference_ranking = Ranking(reference_list)
predicted_ranking = Ranking(predicted_list)

Step Four, Individual Metric¶

There are two versions of each metric included in the library: an aggregate (set) version and an individual version. Here's how the individual version of Kendall's Tau works. It builds on the previous code block, where you create your Ranking objects.

In [3]:

from evaluation.ranking.segment import kendall_tau

result = kendall_tau(predicted_ranking, reference_ranking)

result.tau

Out[3]:

0.8181818181818182

More individual metrics are available. Import them as reciprocal_rank and ndgc_err from the same module.

Step Five, Set Metrics¶

This is where it gets real. When you adapt this code for your own problems this is probably what you'll start with.

In [4]:

from evaluation.ranking.set import kendall_tau_set

# make a second example prediction
predicted_list_two = [12, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1]
predicted_ranking_two = Ranking(predicted_list_two)

predicted_set = [predicted_ranking, predicted_ranking_two]
reference_set = [reference_ranking, reference_ranking]

result_set = kendall_tau_set(predicted_set, reference_set)

result_set

Out[4]:

{'tau': 0.5909090909090909,
 'tau_all_pairs': 132,
 'tau_avg_seg': 0.59090909090909094,
 'tau_avg_seg_prob': 4.2322067287064823e-27,
 'tau_concordant': 105,
 'tau_discordant': 27,
 'tau_original_ties': 0,
 'tau_predicted_ties': 0,
 'tau_predicted_ties_per': 0.0,
 'tau_prob': 9.1709457975947886e-24,
 'tau_sentence_ties': 0,
 'tau_sentence_ties_per': 0.0,
 'tau_valid_pairs': 132}

More metrics: mrr, avg_ndgc_err

Conclusion¶

This is a pretty simple example, but it should be enough to get you started. In real world applications the elements in your reference set are likely to be unique (instead of duplicates, as in this example) and each entry should line up with it's corresponding prediction in the predicted set.

If your reference does turn out to be some simple transformation of your input, you should probably use some kind of Cross-Validation.

May your beards always be full and your frames thick.