Introduction to Non-Personalized Recommenders

The recommendation problem

Recommenders have been around since at least 1992. Today we see different flavours of recommenders, deployed across different verticals:

  • Amazon
  • Netflix
  • Facebook
  • Last.fm.

What exactly do they do?

Definitions from the literature

In a typical recommender system people provide recommendations as inputs, which the system then aggregates and directs to appropriate recipients. -- Resnick and Varian, 1997

Collaborative filtering simply means that people collaborate to help one another perform filtering by recording their reactions to documents they read. -- Goldberg et al, 1992

In its most common formulation, the recommendation problem is reduced to the problem of estimating ratings for the items that have not been seen by a user. Intuitively, this estimation is usually based on the ratings given by this user to other items and on some other information [...] Once we can estimate ratings for the yet unrated items, we can recommend to the user the item(s) with the highest estimated rating(s). -- Adomavicius and Tuzhilin, 2005

Driven by computer algorithms, recommenders help consumers by selecting products they will probably like and might buy based on their browsing, searches, purchases, and preferences. -- Konstan and Riedl, 2012

Notation

  • $U$ is the set of users in our domain. Its size is $|U|$.
  • $I$ is the set of items in our domain. Its size is $|I|$.
  • $I(u)$ is the set of items that user $u$ has rated.
  • $-I(u)$ is the complement of $I(u)$ i.e., the set of items not yet seen by user $u$.
  • $U(i)$ is the set of users that have rated item $i$.
  • $-U(i)$ is the complement of $U(i)$.

Goal of a recommendation system

$$ \newcommand{\argmax}{\mathop{\rm argmax}\nolimits} \forall{u \in U},\; i^* = \argmax_{i \in -I(u)} [S(u,i)] $$

Problem statement

The recommendation problem in its most basic form is quite simple to define:

|-------------------+-----+-----+-----+-----+-----|
| user_id, movie_id | m_1 | m_2 | m_3 | m_4 | m_5 |
|-------------------+-----+-----+-----+-----+-----|
| u_1               | ?   | ?   | 4   | ?   | 1   |
|-------------------+-----+-----+-----+-----+-----|
| u_2               | 3   | ?   | ?   | 2   | 2   |
|-------------------+-----+-----+-----+-----+-----|
| u_3               | 3   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_4               | ?   | 1   | 2   | 1   | 1   |
|-------------------+-----+-----+-----+-----+-----|
| u_5               | ?   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_6               | 2   | ?   | 2   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_7               | ?   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_8               | 3   | 1   | 5   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_9               | ?   | ?   | ?   | ?   | 2   |
|-------------------+-----+-----+-----+-----+-----|

Given a partially filled matrix of ratings ($|U|x|I|$), estimate the missing values.

Challenges

Availability of item metadata

Content-based techniques are limited by the amount of metadata that is available to describe an item. There are domains in which feature extraction methods are expensive or time consuming, e.g., processing multimedia data such as graphics, audio/video streams. In the context of grocery items for example, it's often the case that item information is only partial or completely missing. Examples include:

  • Ingredients
  • Nutrition facts
  • Brand
  • Description
  • County of origin

New user problem

A user has to have rated a sufficient number of items before a recommender system can have a good idea of what their preferences are. In a content-based system, the aggregation function needs ratings to aggregate.

New item problem

Collaborative filters rely on an item being rated by many users to compute aggregates of those ratings. Think of this as the exact counterpart of the new user problem for content-based systems.

Data sparsity

When looking at the more general versions of content-based and collaborative systems, the success of the recommender system depends on the availability of a critical mass of user/item iteractions. We get a first glance at the data sparsity problem by quantifying the ratio of existing ratings vs $|U|x|I|$. A highly sparse matrix of interactions makes it difficult to compute similarities between users and items. As an example, for a user whose tastes are unusual compared to the rest of the population, there will not be any other users who are particularly similar, leading to poor recommendations.

Flow chart: the big picture

In [3]:
from IPython.core.display import Image 
Image(filename='./imgs/recsys_arch.png')
Out[3]:

The CourseTalk dataset: loading and first look

Loading of the CourseTalk database.

The CourseTalk data is spread across three files. Using the pd.read_table method we load each file:

In [5]:
import pandas as pd

unames = ['user_id', 'username']
users = pd.read_table('./data/users_set.dat',
                      sep='|', header=None, names=unames)

rnames = ['user_id', 'course_id', 'rating']
ratings = pd.read_table('./data/ratings.dat',
                        sep='|', header=None, names=rnames)

mnames = ['course_id', 'title', 'avg_rating', 'workload', 'university', 'difficulty', 'provider']
courses = pd.read_table('./data/cursos.dat',
                       sep='|', header=None, names=mnames)

# show how one of them looks
ratings.head(10)
Out[5]:
user_id course_id rating
0 1 1 5
1 2 1 5
2 3 1 5
3 4 1 5
4 5 1 5
5 6 1 5
6 7 1 5
7 8 1 5
8 9 1 5
9 10 1 5
In [293]:
# show how one of them looks
users[:5]
Out[293]:
user_id username
0 1 patrickdijusto1
1 2 natalya_ivanova
2 3 justineittreim
3 4 ronmay
4 5 paulstock
In [254]:
courses[:5]
Out[254]:
course_id title avg_rating workload university difficulty provider
0 1 An Introduction to Interactive Programming in ... 4.9 7-10 hours/week Rice University Medium coursera
1 2 Modern & Contemporary American Poetry 4.9 5-9 hours/week University of Pennsylvania Easy/medium coursera
2 3 A Beginner's Guide to Irrational Behavior 4.9 7-10 hours/week Duke University Medium coursera
3 4 Design: Creation of Artifacts in Society 4.9 5-10 hours/week University of Pennsylvania Medium coursera
4 5 Greek and Roman Mythology 4.9 8-10 hours/week University of Pennsylvania Medium coursera

Using pd.merge we get it all into one big DataFrame.

In [6]:
coursetalk = pd.merge(pd.merge(ratings, courses), users)
coursetalk
Out[6]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2773 entries, 0 to 2772
Data columns (total 10 columns):
user_id       2773  non-null values
course_id     2773  non-null values
rating        2773  non-null values
title         2773  non-null values
avg_rating    2773  non-null values
workload      2773  non-null values
university    2616  non-null values
difficulty    2773  non-null values
provider      2773  non-null values
username      2773  non-null values
dtypes: float64(1), int64(2), object(7)
In [295]:
coursetalk.ix[0]
Out[295]:
user_id                                                       1
course_id                                                     1
rating                                                        5
title         An Introduction to Interactive Programming in ...
avg_rating                                                  4.9
workload                                        7-10 hours/week
university                                      Rice University
difficulty                                               Medium
provider                                               coursera
username                                        patrickdijusto1
Name: 0, dtype: object

Collaborative filtering: generalizations of the aggregation function

Non-personalized recommendations

Groupby

The idea of groupby is that of split-apply-combine:

  • split data in an object according to a given key;
  • apply a function to each subset;
  • combine results into a new object.

To get mean course ratings grouped by the provider, we can use the pivot_table method:

In [284]:
mean_ratings = coursetalk.pivot_table('rating', rows='provider', aggfunc='mean')
mean_ratings.order(ascending=False)
Out[284]:
provider
None            4.562500
coursera        4.527835
edx             4.491620
codecademy      4.450000
udacity         4.241071
udemy           4.200000
open2study      4.083333
khanacademy     4.000000
novoed          3.281250
mruniversity    3.250000
Name: rating, dtype: float64

Now let's filter down to courses that received at least 20 ratings (a completely arbitrary number); To do this, I group the data by course_id and use size() to get a Series of group sizes for each title:

In [297]:
ratings_by_title = coursetalk.groupby('title').size()
ratings_by_title[:10]
Out[297]:
title
14.73x: The Challenges of Global Poverty                     2
2.01x: Elements of Structures                                2
3.091x: Introduction to Solid State Chemistry                3
6.002x: Circuits and Electronics                            10
6.00x: Introduction to Computer Science and Programming     21
7.00x: Introduction to Biology - The Secret of Life          3
8.02x: Electricity and Magnetism                             3
8.MReVx: Mechanics ReView                                    1
A Beginner&#39;s Guide to Irrational Behavior              147
A Crash Course on Creativity                                 5
dtype: int64
In [298]:
active_titles = ratings_by_title.index[ratings_by_title >= 20]
active_titles[:10]
Out[298]:
Index([u'6.00x: Introduction to Computer Science and Programming', u'A Beginner&#39;s Guide to Irrational Behavior', u'An Introduction to Interactive Programming in Python', u'An Introduction to Operations Management', u'CS-191x: Quantum Mechanics and Quantum Computation', u'CS188.1x Artificial Intelligence', u'Calculus: Single Variable', u'Computing for Data Analysis', u'Critical Thinking in Global Challenges', u'Cryptography I'], dtype=object)

The index of titles receiving at least 20 ratings can then be used to select rows from mean_ratings above:

In [300]:
mean_ratings = coursetalk.pivot_table('rating', rows='title', aggfunc='mean')
mean_ratings
Out[300]:
title
14.73x: The Challenges of Global Poverty                        4.250000
2.01x: Elements of Structures                                   4.750000
3.091x: Introduction to Solid State Chemistry                   4.166667
6.002x: Circuits and Electronics                                4.800000
6.00x: Introduction to Computer Science and Programming         4.166667
7.00x: Introduction to Biology - The Secret of Life             4.666667
8.02x: Electricity and Magnetism                                4.333333
8.MReVx: Mechanics ReView                                       5.000000
A Beginner&#39;s Guide to Irrational Behavior                   4.874150
A Crash Course on Creativity                                    3.500000
A History of the World since 1300                               4.318182
A Look at Nuclear Science and Technology                        3.000000
A New History for a New China, 1700-2000: New Data and New Methods, Part 1    0.500000
AIDS                                                            5.000000
Aboriginal Worldviews and Education                             4.333333
...
The Modern World: Global History since 1760        4.775862
The Modern and the Postmodern                      4.777778
The Science of Gastronomy                          4.000000
The Social Context of Mental Health and Illness    4.333333
Think Again: How to Reason and Argue               3.815789
Useful Genetics Part 1                             4.500000
VLSI CAD:  Logic to Layout                         4.500000
Vaccine Trials: Methods and Best Practices         5.000000
Vaccines                                           3.750000
Web Development                                    4.625000
Web Intelligence and Big Data                      3.802326
Women and the Civil Rights Movement                5.000000
Writing for the Web (WriteWeb)                     5.000000
Writing in the Sciences                            4.000000
jQuery                                             4.250000
Name: rating, Length: 211, dtype: float64

By computing the mean rating for each course, we will order with the highest rating listed first.

In [301]:
mean_ratings.ix[active_titles].order(ascending=False)
Out[301]:
title
An Introduction to Interactive Programming in Python            4.915652
Modern &amp; Contemporary American Poetry                       4.901515
Design: Creation of Artifacts in Society                        4.879581
A Beginner&#39;s Guide to Irrational Behavior                   4.874150
Greek and Roman Mythology                                       4.864198
Calculus: Single Variable                                       4.854167
CS188.1x Artificial Intelligence                                4.833333
Machine Learning                                                4.830000
Functional Programming Principles in Scala                      4.822581
Gamification                                                    4.796296
An Introduction to Operations Management                        4.785714
The Modern World: Global History since 1760                     4.775862
Programming Languages                                           4.770833
CS-191x: Quantum Mechanics and Quantum Computation              4.727273
Cryptography I                                                  4.700000
Discrete Optimization                                           4.695652
Introduction to Computer Science                                4.687500
Learn to Program: Crafting Quality Code                         4.585714
Model Thinking                                                  4.578125
Internet History, Technology, and Security                      4.541667
Fantasy and Science Fiction: The Human Mind, Our Modern World    4.522727
Learn to Program: The Fundamentals                              4.303571
6.00x: Introduction to Computer Science and Programming         4.166667
Critical Thinking in Global Challenges                          3.961538
Web Intelligence and Big Data                                   3.802326
Computing for Data Analysis                                     3.187500
Introduction to Finance                                         3.086957
Introduction to Data Science                                    3.060000
Name: rating, dtype: float64

To see the top courses among Coursera students, we can sort by the 'Coursera' column in descending order:

In [7]:
mean_ratings = coursetalk.pivot_table('rating', rows='title',cols='provider', aggfunc='mean')
mean_ratings[:10]
Out[7]:
provider None codecademy coursera edx khanacademy mruniversity novoed open2study udacity udemy
title
14.73x: The Challenges of Global Poverty NaN NaN NaN 4.250000 NaN NaN NaN NaN NaN NaN
2.01x: Elements of Structures NaN NaN NaN 4.750000 NaN NaN NaN NaN NaN NaN
3.091x: Introduction to Solid State Chemistry NaN NaN NaN 4.166667 NaN NaN NaN NaN NaN NaN
6.002x: Circuits and Electronics NaN NaN NaN 4.800000 NaN NaN NaN NaN NaN NaN
6.00x: Introduction to Computer Science and Programming NaN NaN NaN 4.166667 NaN NaN NaN NaN NaN NaN
7.00x: Introduction to Biology - The Secret of Life NaN NaN NaN 4.666667 NaN NaN NaN NaN NaN NaN
8.02x: Electricity and Magnetism NaN NaN NaN 4.333333 NaN NaN NaN NaN NaN NaN
8.MReVx: Mechanics ReView NaN NaN NaN 5.000000 NaN NaN NaN NaN NaN NaN
A Beginner&#39;s Guide to Irrational Behavior NaN NaN 4.87415 NaN NaN NaN NaN NaN NaN NaN
A Crash Course on Creativity NaN NaN NaN NaN NaN NaN 3.5 NaN NaN NaN
In [303]:
mean_ratings['coursera'][active_titles].order(ascending=False)[:10]
Out[303]:
title
An Introduction to Interactive Programming in Python    4.915652
Modern &amp; Contemporary American Poetry               4.901515
Design: Creation of Artifacts in Society                4.879581
A Beginner&#39;s Guide to Irrational Behavior           4.874150
Greek and Roman Mythology                               4.864198
Calculus: Single Variable                               4.854167
Programming Languages                                   4.850000
Machine Learning                                        4.830000
Functional Programming Principles in Scala              4.822581
Gamification                                            4.796296
Name: coursera, dtype: float64

Now, let's go further! How about rank the courses with the highest percentage of ratings that are 4 or higher ? % of ratings 4+

Let's start with a simple pivoting example that does not involve any aggregation. We can extract a ratings matrix as follows:

In [8]:
# transform the ratings frame into a ratings matrix
ratings_mtx_df = coursetalk.pivot_table(values='rating',
                                             rows='user_id',
                                             cols='title')
ratings_mtx_df.ix[ratings_mtx_df.index[:15], ratings_mtx_df.columns[:15]]
Out[8]:
title 14.73x: The Challenges of Global Poverty 2.01x: Elements of Structures 3.091x: Introduction to Solid State Chemistry 6.002x: Circuits and Electronics 6.00x: Introduction to Computer Science and Programming 7.00x: Introduction to Biology - The Secret of Life 8.02x: Electricity and Magnetism 8.MReVx: Mechanics ReView A Beginner&#39;s Guide to Irrational Behavior A Crash Course on Creativity A History of the World since 1300 A Look at Nuclear Science and Technology A New History for a New China, 1700-2000: New Data and New Methods, Part 1 AIDS Aboriginal Worldviews and Education
user_id
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN 5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
13 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
14 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
15 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Let's extract only the rating that are 4 or higher.

In [19]:
ratings_gte_4 = ratings_mtx_df[ratings_mtx_df>=4.0]
# with an integer axis index only label-based indexing is possible

ratings_gte_4.ix[ratings_gte_4.index[:15], ratings_gte_4.columns[:15]]
Out[19]:
title 14.73x: The Challenges of Global Poverty 2.01x: Elements of Structures 3.091x: Introduction to Solid State Chemistry 6.002x: Circuits and Electronics 6.00x: Introduction to Computer Science and Programming 7.00x: Introduction to Biology - The Secret of Life 8.02x: Electricity and Magnetism 8.MReVx: Mechanics ReView A Beginner&#39;s Guide to Irrational Behavior A Crash Course on Creativity A History of the World since 1300 A Look at Nuclear Science and Technology A New History for a New China, 1700-2000: New Data and New Methods, Part 1 AIDS Aboriginal Worldviews and Education
user_id
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN 5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
13 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
14 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
15 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Now picking the number of total ratings for each course and the count of ratings 4+ , we can merge them into one DataFrame.

In [90]:
ratings_gte_4_pd = pd.DataFrame({'total': ratings_mtx_df.count(), 'gte_4': ratings_gte_4.count()})
ratings_gte_4_pd.head(10)
Out[90]:
gte_4 total
title
14.73x: The Challenges of Global Poverty 2 2
2.01x: Elements of Structures 2 2
3.091x: Introduction to Solid State Chemistry 2 3
6.002x: Circuits and Electronics 10 10
6.00x: Introduction to Computer Science and Programming 15 21
7.00x: Introduction to Biology - The Secret of Life 3 3
8.02x: Electricity and Magnetism 2 3
8.MReVx: Mechanics ReView 1 1
A Beginner&#39;s Guide to Irrational Behavior 146 147
A Crash Course on Creativity 2 5
In [92]:
ratings_gte_4_pd['gte_4_ratio'] = (ratings_gte_4_pd['gte_4'] * 1.0)/ ratings_gte_4_pd.total
ratings_gte_4_pd.head(10)
Out[92]:
gte_4 total gte_4_ratio
title
14.73x: The Challenges of Global Poverty 2 2 1.000000
2.01x: Elements of Structures 2 2 1.000000
3.091x: Introduction to Solid State Chemistry 2 3 0.666667
6.002x: Circuits and Electronics 10 10 1.000000
6.00x: Introduction to Computer Science and Programming 15 21 0.714286
7.00x: Introduction to Biology - The Secret of Life 3 3 1.000000
8.02x: Electricity and Magnetism 2 3 0.666667
8.MReVx: Mechanics ReView 1 1 1.000000
A Beginner&#39;s Guide to Irrational Behavior 146 147 0.993197
A Crash Course on Creativity 2 5 0.400000
In [86]:
ranking = [(title,total,gte_4, score) for title, total, gte_4, score in ratings_gte_4_pd.itertuples()]

for title, total, gte_4, score in sorted(ranking, key=lambda x: (x[3], x[2], x[1])  , reverse=True)[:10]:
    print title, total, gte_4, score
Functional Programming Principles in Scala 31 31 1.0
Introduction to Computer Science 24 24 1.0
Programming Languages 24 24 1.0
Web Development 16 16 1.0
6.002x: Circuits and Electronics 10 10 1.0
Compilers 8 8 1.0
Archaeology&#39;s Dirty Little Secrets 7 7 1.0
How to Build a Startup 7 7 1.0
Introduction to Sociology 7 7 1.0
Stat2.1X: Introduction to Statistics: Descriptive Statistics 7 7 1.0

Let's now go easy. Let's count the number of ratings for each course, and order with the most number of ratings.

In [96]:
ratings_by_title = coursetalk.groupby('title').size()
ratings_by_title.order(ascending=False)[:10]
Out[96]:
title
An Introduction to Interactive Programming in Python    575
Design: Creation of Artifacts in Society                191
A Beginner&#39;s Guide to Irrational Behavior           147
Modern &amp; Contemporary American Poetry               132
An Introduction to Operations Management                 98
Greek and Roman Mythology                                81
Critical Thinking in Global Challenges                   65
Gamification                                             54
Machine Learning                                         50
Web Intelligence and Big Data                            43
dtype: int64

Considering this information we can sort by the most rated ones with highest percentage of 4+ ratings.

In [97]:
for title, total, gte_4, score in sorted(ranking, key=lambda x: (x[2], x[3], x[1])  , reverse=True)[:10]:
    print title, total, gte_4, score
An Introduction to Interactive Programming in Python 572 575 0.994782608696
Design: Creation of Artifacts in Society 190 191 0.994764397906
A Beginner&#39;s Guide to Irrational Behavior 146 147 0.993197278912
Modern &amp; Contemporary American Poetry 130 132 0.984848484848
An Introduction to Operations Management 96 98 0.979591836735
Greek and Roman Mythology 80 81 0.987654320988
Critical Thinking in Global Challenges 47 65 0.723076923077
Gamification 52 54 0.962962962963
Machine Learning 48 49 0.979591836735
Web Intelligence and Big Data 26 43 0.604651162791

Finally using the formula above that we learned, let's find out what the courses that most often occur wit the popular MOOC An introduction to Interactive Programming with Python by using the method "x + y/ x" . For each course, calculate the percentage of Programming with python raters who also rated that course. Order with the highest percentage first, and voilá we have the top 5 moocs.

In [102]:
course_users = coursetalk.pivot_table('rating', rows='title', cols='user_id')
course_users.ix[course_users.index[:15], course_users.columns[:15]]
Out[102]:
user_id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
title
14.73x: The Challenges of Global Poverty NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2.01x: Elements of Structures NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3.091x: Introduction to Solid State Chemistry NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6.002x: Circuits and Electronics NaN NaN NaN NaN NaN 5 NaN NaN NaN NaN NaN NaN NaN NaN NaN
6.00x: Introduction to Computer Science and Programming NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7.00x: Introduction to Biology - The Secret of Life NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8.02x: Electricity and Magnetism NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8.MReVx: Mechanics ReView NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
A Beginner&#39;s Guide to Irrational Behavior NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
A Crash Course on Creativity NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
A History of the World since 1300 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
A Look at Nuclear Science and Technology NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
A New History for a New China, 1700-2000: New Data and New Methods, Part 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
AIDS NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Aboriginal Worldviews and Education NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

First, let's get only the users that rated the course An Introduction to Interactive Programming in Python

In [122]:
ratings_by_course = coursetalk[coursetalk.title == 'An Introduction to Interactive Programming in Python']
ratings_by_course.set_index('user_id', inplace=True)

Now, for all other courses let's filter out only the ratings from users that rated the Python course.

In [138]:
their_ids = ratings_by_course.index
their_ratings = course_users[their_ids]
course_users[their_ids].ix[course_users[their_ids].index[:15], course_users[their_ids].columns[:15]]
Out[138]:
user_id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
title
14.73x: The Challenges of Global Poverty NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2.01x: Elements of Structures NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3.091x: Introduction to Solid State Chemistry NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6.002x: Circuits and Electronics NaN NaN NaN NaN NaN 5 NaN NaN NaN NaN NaN NaN NaN NaN NaN
6.00x: Introduction to Computer Science and Programming NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7.00x: Introduction to Biology - The Secret of Life NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8.02x: Electricity and Magnetism NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8.MReVx: Mechanics ReView NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
A Beginner&#39;s Guide to Irrational Behavior NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
A Crash Course on Creativity NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
A History of the World since 1300 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
A Look at Nuclear Science and Technology NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
A New History for a New China, 1700-2000: New Data and New Methods, Part 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
AIDS NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Aboriginal Worldviews and Education NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

By applying the division: number of ratings who rated Python Course and the given course / total of ratings who rated the Python Course we have our percentage.

In [158]:
course_count =  their_ratings.ix['An Introduction to Interactive Programming in Python'].count()
sims = their_ratings.apply(lambda profile: profile.count() / float(course_count) , axis=1)

Ordering by the score, highest first excepts the first one which contains the course itself.

In [162]:
sims.order(ascending=False)[1:][:10]
Out[162]:
title
Machine Learning                           0.006957
Cryptography I                             0.006957
Web Development                            0.005217
Python                                     0.005217
Learn to Program: Crafting Quality Code    0.005217
Introduction to Computer Science           0.005217
Human-Computer Interaction                 0.005217
Gamification                               0.005217
Computational Investing, Part I            0.005217
CS-169.1x: Software as a Service           0.005217
dtype: float64
Back to top