Method or Reflections analysis on *.genius.com.¶

For genius.com by Max Klein¶

4 November 2014¶

What are we analysing? An introduction to the Method of Reflections¶

There are two tables below one concerning the origin of the user-"quality" and the other of article-"quality". Let us be clear, "quality" is determined by our exogenous metrics, in this case user-contribution totals, and article-contributor totals (over all *.genius subdomains). When we talk about the origin of the quality, we are discussing which variables we found most important in building a "Method of Reflections" ranking (loosely Google PageRank simulation) of the subdomain network that most closely matches the rankings given by the exogenous.

For instance for the users in lit.genius we find a Spearman $\rho$ correlation of about 0.6. That means we can predict user-contribution activity with a 60% correlation (and probably much higher with full data), just from knowing which users edit which texts (without knowing how much they edit each one).

As fantastical as that sounds, the result comes from an economics insight that the best economies produce the rarest products, and also ubiquitous ones - whereas worse economies produce only ubiquitous products, and hardly ever rare ones. The genius network is similar (see the matrix visualisations below), the top users edit the widest range of texts, and and casual users tend to edit texts that have already been annotated by others.

Going back to the lit.genius example, or method of reflection simulation is dependent on two inputs: $\alpha$, the importance of text-quality, and $\beta$ the importance of user-quality. Of the (25 $\alpha$ values * 25 $\beta$ values)=625 method of reflections simulations, we look for the one with the highest rank correlation, and then consider the $\alpha , \beta$ inputs that got us there (see grid search results below).

Now $\alpha, and \beta$ are negative exponents in our model see equation 13, so the lower or more negative they are the more important they are (think golf). Now if I say that user-quality increases close to linearly, $\beta = 0$, (or in some cases super-linearly $\beta< 0$) with regards to the quality of the editors editing the same texts as you, then we can consider that a "collaborative" subdomain.

Other analyses are available from this data - the importance of neighbouring text quality in predicting user quality, or the importance of neighbouring text quality in predicting text quality. However I argue that the $\beta$ measure for users, is the one that interprets best as collaborativeness.

Limitations in rank correlations:¶

The rank correlations are important here because they determine the underlying validity of the following results. In general they are lower than the correlations I have encountered on Wikipedia. However they range from about $(0.3, 0.6)$ compare to $(0.5, 0.9)$ on Wikipedia. The results presented here are statistically significant, so I would still consider this analysis valid and meaningful. The reasons this analysis could be lower are: (a) limited data supplied by genius (b) poorer exogenous metrics, IQ may be better or (c) genius users behave differently from other socio-technical websites.

(By the way, I encountered some errors that user_ids, were the same as text_ids. So in this analysis user ids have a prepended "u" and text ids a prepended "t".)

Results:¶

Most collaborative¶

Sports, X, and Law, show themselves as the top 3 most collaborative with $\beta$ scores of $-0.72, 0.24, and, 0.4$ respectively . Sports in specific has a negative $\beta$ measure which means that user-quality is predicted super-linearly with regards the neighbouring users. I would predict that there are a group of users that have somehow organised themselves there. The only Wikipedia case found with a negative $\beta$ was US. Military History, which has a dedicated mailing list. Notice also from the text-user matrices that this is one of the tallest matrices, so there are relatively many texts edited per user compared to the others.

X doing very well however is a surprising result to because one would imagine that the "no specific subject nature" might make it more jungle-like and thus less collaborative. The results is actually quite well explained by considering the humorous but true "Zeroeth law of Wikipedia: Wikipedia only works in practice in theory it can never work." Counter-intuitively, people collaborate a better with less constraints rather than more. As people are given more freedoms online they respond well due to unrealised incentives. From an Wikipedian's perspective this is makes a lot of sense, that a company can never make decisions for the community as well as the community.

Least collaborative¶

News, History, and Rock clock in as the least collaborative of the subdomains with $\beta$ measures of $1.84, 1.52, and, 1.52$. This doesn't mean that they are necessarily "uncollaborative" only less collaborative than the other subdomains. Essentially user-quality is a sub-linear function of the quality of the neighbouring users. Therefore users are not helping one another - as much - to become better users. A naïve explanation might be that since news-commenting and events-comment already have a precedence online for being unproductive pools of back-and-forth argumentation, users unwittingly transfer this behaviour onto the genius website. Even though genius has a different goal to the comment sections of a news websites, users are preconditioned by those sites to not necessarily want to build together an communal annotation.

Early Conclusions¶

The biggest surprise is that X doing well. This is evidence that an opening up of topics allows users to self-organise better than with a topic constraint. I would present this as push to trusting the user base in defining their own scopes.

Data Results and Visualisations¶

Calibration Data Frames
User-Text Matrices
Calibration Optimization Grid Searches
Example Method of Refections Ranking Convergence

In [271]:

user_df = pd.DataFrame.from_dict(user_calibrations, orient='index')
user_df

Out[271]:

	alpha	beta	rho
history	-1.84	1.52	0.481851
law	0.40	0.40	0.429294
lit	-0.08	0.88	0.594491
news	0.08	1.84	0.353811
pop	-1.68	0.56	0.271077
r-b	-1.52	0.56	0.297127
rap	-1.84	0.56	0.445590
rock	-1.68	1.52	0.476813
screen	-2.00	1.36	0.469442
sports	-2.00	-0.72	0.307567
tech	-2.00	0.88	0.473089
x	-2.00	0.24	0.369506

In [272]:

article_df = pd.DataFrame.from_dict(article_calibrations, orient='index')
article_df

Out[272]:

	alpha	beta	rho
history	0.24	-2.00	0.551461
law	0.40	-1.20	0.401671
lit	1.52	-0.08	0.342453
news	0.24	0.56	0.372917
pop	0.08	0.72	0.368729
r-b	0.08	0.72	0.735256
rap	0.40	0.72	0.262893
rock	0.24	-0.24	0.222322
screen	0.08	-0.88	0.434323
sports	0.08	-2.00	0.376457
tech	0.40	0.56	0.660880
x	0.24	-1.52	0.321988

In [171]:

for name, bipartite_dict in bipartite_dicts.iteritems():
    M, text_dict, user_dict = viz_bipartite(bipartite_dict, name)

    np.save('geniusdata/'+name+'/M.npy', M)
    json.dump(text_dict, open('geniusdata/'+name+'/text_dict.json', 'w'))
    json.dump(user_dict, open('geniusdata/'+name+'/user_dict.json', 'w'))

    text_exogenous_ranks = make_exogenous_ranks(text_dict, all_text_exogenous)
    user_exogenous_ranks = make_exogenous_ranks(user_dict, all_user_exogenous)

    json.dump(text_exogenous_ranks, open('geniusdata/'+name+'/text_exogenous_ranks.json', 'w'))
    json.dump(user_exogenous_ranks, open('geniusdata/'+name+'/user_exogenous_ranks.json', 'w'))
        

<matplotlib.figure.Figure at 0x7f87ef967650>

<matplotlib.figure.Figure at 0x7f87f10eb510>

<matplotlib.figure.Figure at 0x7f87ef3a0550>

<matplotlib.figure.Figure at 0x7f87efd6b2d0>

<matplotlib.figure.Figure at 0x7f87efb0a150>

<matplotlib.figure.Figure at 0x7f87eeae2190>

<matplotlib.figure.Figure at 0x7f87f0feaa90>

<matplotlib.figure.Figure at 0x7f87efae0cd0>

<matplotlib.figure.Figure at 0x7f87f10022d0>

<matplotlib.figure.Figure at 0x7f87eda85390>

<matplotlib.figure.Figure at 0x7f87ebbc7b90>

<matplotlib.figure.Figure at 0x7f87ef9cbd90>

<matplotlib.figure.Figure at 0x7f87ef8b2e90>

<matplotlib.figure.Figure at 0x7f87ecf7c110>

<matplotlib.figure.Figure at 0x7f87ed8ea4d0>

<matplotlib.figure.Figure at 0x7f87ef9ba8d0>

<matplotlib.figure.Figure at 0x7f87efb1b690>

<matplotlib.figure.Figure at 0x7f87ebbc7650>

<matplotlib.figure.Figure at 0x7f87ef3aa590>

<matplotlib.figure.Figure at 0x7f87ef8b2a10>

<matplotlib.figure.Figure at 0x7f87eec3b5d0>

<matplotlib.figure.Figure at 0x7f87ef3b4610>

<matplotlib.figure.Figure at 0x7f87f0d10310>

<matplotlib.figure.Figure at 0x7f87ed8ea590>

In [266]:

article_calibrations = dict()
for name, data in subdomain_data.iteritems():
    calibrations = calibrate(data, 'articles', name)
    print name, calibrations
    article_calibrations[name] = calibrations

r-b {'alpha': 0.08, 'beta': 0.72, 'rho': 0.73525631322687313}
screen {'alpha': 0.08, 'beta': -0.88, 'rho': 0.43432251032103325}
pop {'alpha': 0.08, 'beta': 0.72, 'rho': 0.3687289516535695}
sports {'alpha': 0.08, 'beta': -2.0, 'rho': 0.37645660335704706}
lit {'alpha': 1.52, 'beta': -0.08, 'rho': 0.34245319801714619}
tech {'alpha': 0.4, 'beta': 0.56, 'rho': 0.6608802421201827}
x {'alpha': 0.24, 'beta': -1.52, 'rho': 0.32198843556375556}
rap {'alpha': 0.4, 'beta': 0.72, 'rho': 0.26289283884165071}
rock {'alpha': 0.24, 'beta': -0.24, 'rho': 0.22232154708118684}
news {'alpha': 0.24, 'beta': 0.56, 'rho': 0.37291663312294737}
law {'alpha': 0.4, 'beta': -1.2, 'rho': 0.40167098257037015}
history {'alpha': 0.24, 'beta': -2.0, 'rho': 0.55146056262305387}

In [265]:

user_calibrations = dict()
for name, data in subdomain_data.iteritems():
    calibrations = calibrate(data, 'users', name)
    print name, calibrations
    user_calibrations[name] = calibrations

r-b {'alpha': -1.52, 'beta': 0.56, 'rho': 0.29712697513930469}
screen {'alpha': -2.0, 'beta': 1.36, 'rho': 0.46944201868643853}
pop {'alpha': -1.68, 'beta': 0.56, 'rho': 0.27107665067574438}
sports {'alpha': -2.0, 'beta': -0.72, 'rho': 0.30756706463723915}
lit {'alpha': -0.08, 'beta': 0.88, 'rho': 0.59449131632269803}
tech {'alpha': -2.0, 'beta': 0.88, 'rho': 0.4730886363885195}
x {'alpha': -2.0, 'beta': 0.24, 'rho': 0.36950582819547689}
rap {'alpha': -1.84, 'beta': 0.56, 'rho': 0.44559011704436363}
rock {'alpha': -1.68, 'beta': 1.52, 'rho': 0.47681330924442988}
news {'alpha': 0.08, 'beta': 1.84, 'rho': 0.3538108376542729}
law {'alpha': 0.4, 'beta': 0.4, 'rho': 0.42929372670471688}
history {'alpha': -1.84, 'beta': 1.52, 'rho': 0.48185092927104095}

Massaging Data¶

In [194]:

import json
import numpy as np
import networkx as nx
from collections import defaultdict
import operator
import pandas as pd
import scipy.stats as ss

Take a look at the attached files. They are:

tag_dictionary.json
text_dictionary.json - maps text ids to their tag id, their annotating user ids, and the total number of annotations
user_dictionary.json - maps user ids to a measure of total activity on the site```

In [160]:

!ls geniusdata/

history  lit   pop  r-b   screen  tag_dictionary.json  text_dictionary.json  x
law	 news  rap  rock  sports  tech		       user_dictionary.json

In [161]:

genius_data = {prefix : json.load(open('geniusdata/%s_dictionary.json' % prefix, 'r')) for prefix in ['tag','text','user']}

A quick verification that the data is consistent

In [162]:

users = set()
for text_id, text_dict in genius_data['text'].iteritems():
    for user in text_dict['annotating_users']:
        users.add(user)
        
print len(users) == len(genius_data['user'])

True

In [163]:

#need to reverse the tag dict
id_to_tag = dict()
for name, tag_id in genius_data['tag'].iteritems():
    id_to_tag[tag_id] = name

In [164]:

subdomains = defaultdict(dict)
for text_id, text_dict in genius_data['text'].iteritems():
    tag_id = text_dict['tag_id']
    tag_name = id_to_tag[tag_id]
    subdomains[tag_name][text_id] = text_dict
    

In [165]:

print {subdomain_name: len(subdomain_dict) for subdomain_name, subdomain_dict in subdomains.iteritems()}

{u'lit': 1000, u'screen': 718, u'pop': 1000, u'sports': 1000, u'r-b': 247, u'tech': 71, u'rap': 1000, u'rock': 1000, u'x': 961, u'news': 1000, u'law': 153, u'history': 507}

hmm looks like someone wrote some SQL query that limited something to 1000 items. I wonder if 'x' and 'history' actually have less than 1000 articles in them?

In [166]:

def make_bipartite_dict(subdomain_dict):
    return {'t'+text_id: ['u'+str(user) for user in  text_dict['annotating_users']] for text_id, text_dict in subdomain_dict.iteritems()}

In [167]:

bipartite_dicts = {subdomain_name: make_bipartite_dict(subdomain_dict) for subdomain_name, subdomain_dict in subdomains.iteritems()}

In [168]:

def viz_bipartite(bipartite, name):
    bipartite_G = nx.Graph()
    text_encountered=list()
    user_encountered=list()

    for text, user_list in bipartite.iteritems():
        text_encountered.append(text)
        for user in user_list:
            bipartite_G.add_edge(text, user)
            if user not in user_encountered:
                user_encountered.append(user)
    
            
    M = nx.algorithms.bipartite.basic.biadjacency_matrix(G=bipartite_G, row_order=text_encountered, column_order=user_encountered)

    #i wish numpy wasn't so procedural so I didn't have to do this, pandas does it right by return the object instead of doing it in place
    return_M = M.copy()
    
    text_dict = {text: text_encountered.index(text) for text in text_encountered}
    user_dict = {user: user_encountered.index(user) for user in user_encountered}

    fig = imshow(M, cmap=plt.cm.gray_r, interpolation='nearest')
    fig.axes.set_title(' "Raw" Text-User Matrix for %s' % name )
    fig.axes.set_xlabel('users')
    fig.axes.set_ylabel('texts')
    plt.figure(figsize(15,15))
    plt.show()
    
    
    M.sort(axis=0)
    M.sort(axis=1)
    
    fig = imshow(M, cmap=plt.cm.gray_r, interpolation='nearest')
    fig.axes.set_xlim(fig.axes.get_xlim()[::-1]) 
    fig.axes.set_title(' "Sorted" Text-User Matrix for %s' % name )
    fig.axes.set_xlabel('users')
    fig.axes.set_ylabel('texts')
    plt.figure(figsize(15,15))
    plt.show()
    return return_M, text_dict, user_dict

In [169]:

all_user_exogenous = {'u'+user_id: activity for user_id, activity in genius_data['user'].iteritems()}
all_text_exogenous = {'t'+text_id: text_dict['annotations_count'] for text_id, text_dict in genius_data['text'].iteritems()}

In [170]:

def make_exogenous_ranks(specific_dict, all_dict):
    exogenous_scores = [(user_id, score) for user_id, score in all_dict.iteritems() if user_id in specific_dict.keys()]
    exogenous_scores_order = sorted(exogenous_scores, key=lambda tup: tup[1])
    exogenous_ranks_order  = [tup[0] for tup in exogenous_scores_order]
    exogenous_ranks = zip(exogenous_ranks_order, range(len(exogenous_ranks_order)))
    return exogenous_ranks

In [172]:

'''for subdomain_name, bipartite_dict in bipartite_dicts.iteritems():
    print subdomain_name
    plt.figure(figsize=(10,10))
    bipartite_network = nx.Graph(data=bipartite_dict)
    #TODO colouring the nodetypes
    nx.draw_spring(bipartite_network)
    '''

Out[172]:

'for subdomain_name, bipartite_dict in bipartite_dicts.iteritems():\n    print subdomain_name\n    plt.figure(figsize=(10,10))\n    bipartite_network = nx.Graph(data=bipartite_dict)\n    #TODO colouring the nodetypes\n    nx.draw_spring(bipartite_network)\n    '

Starting Analysis¶

In [173]:

def load_files(folder):
    M = np.load(folder+'M.npy')
    user_dict = json.load(open(folder+'user_dict.json', 'r'))
    article_dict = json.load(open(folder+'text_dict.json', 'r')) 
    user_exogenous_ranks = json.load(open(folder+'user_exogenous_ranks.json', 'r'))
    article_exogenous_ranks = json.load(open(folder+'text_exogenous_ranks.json', 'r'))
    return {'M':M,
            'user_dict':user_dict,
            'article_dict':article_dict,
            'user_exogenous_ranks':user_exogenous_ranks, 
            'article_exogenous_ranks':article_exogenous_ranks}

In [174]:

rap_data = load_files('geniusdata/rap/')

In [175]:

def Gcp_denominateur(M, p, k_c, beta):
    M_p = M[:,p]
    k_c_beta = k_c ** (-1 * beta)
    return np.dot(M_p, k_c_beta)

def Gpc_denominateur(M, c, k_p, alpha):
    M_c = M[c,:]
    k_p_alpha = k_p ** (-1 * alpha)
    return np.dot(M_c, k_p_alpha)


def make_G_hat(M, alpha=1, beta=1):
    '''G hat is Markov chain of length 2
    Gcp is a matrix to go from  contries to product and then 
    Gpc is a matrix to go from products to ccountries'''
    
    k_c  = M.sum(axis=1) #aka k_c summing over the rows
    k_p = M.sum(axis=0) #aka k_p summering over the columns
    
    G_cp = np.zeros(shape=M.shape)
    #Gcp_beta
    for [c, p], val in np.ndenumerate(M):
        numerateur = (M[c,p]) * (k_c[c] ** ((-1) * beta))
        denominateur = Gcp_denominateur(M, p, k_c, beta)
        G_cp[c,p] = numerateur / float(denominateur)
    
    
    G_pc = np.zeros(shape=M.T.shape)
    #Gpc_alpha
    for [p, c], val in np.ndenumerate(M.T):
        numerateur = (M.T[p,c]) * (k_p[p] ** ((-1) * alpha))
        denominateur = Gpc_denominateur(M, c, k_p, alpha)
        G_pc[p,c] = numerateur / float(denominateur)
    
    
    return {'G_cp': G_cp, "G_pc" : G_pc}

def w_generator(M, alpha, beta):
    #this cannot return the zeroeth iteration
    
    G_hat = make_G_hat(M, alpha, beta)
    G_cp = G_hat['G_cp']
    G_pc = G_hat['G_pc']
    #

    fitness_0  = np.sum(M,1)
    ubiquity_0 = np.sum(M,0)
    
    fitness_next = fitness_0
    ubiquity_next = ubiquity_0
    i = 0
    
    while True:
        
        fitness_prev = fitness_next
        ubiquity_prev = ubiquity_next
        i += 1
        
        fitness_next = np.sum( G_cp*ubiquity_prev, axis=1 )
        ubiquity_next = np.sum( G_pc* fitness_prev, axis=1 )
        
        yield {'iteration':i, 'fitness': fitness_next, 'ubiquity': ubiquity_next}
        


def w_stream(M, i, alpha, beta):
    """gets the i'th iteration of reflections of M, 
    but in a memory safe way so we can calculate many generations"""
    if i < 0:
        raise ValueError
    for j in w_generator(M, alpha, beta):
        if j[0] == i:
            return {'fitness': j[1], 'ubiquity': j[2]}
            break
            
def find_convergence(M, alpha, beta, fit_or_ubiq, do_plot=False,):
    '''finds the convergence point (or gives up after 1000 iterations)'''
    if fit_or_ubiq == 'fitness':
        Mshape = M.shape[0]
    elif fit_or_ubiq == 'ubiquity':
        Mshape = M.shape[1]
    
    rankings = list()
    scores = list()
    
    prev_rankdata = np.zeros(Mshape)
    iteration = 0

    for stream_data in w_generator(M, alpha, beta):
        iteration = stream_data['iteration']
        
        data = stream_data[fit_or_ubiq]
        rankdata = data.argsort().argsort()
        
        #test for convergence
        if np.equal(rankdata,prev_rankdata).all():
            break
        if iteration == 1000:
            break
        else:
            rankings.append(rankdata)
            scores.append(data)
            prev_rankdata = rankdata
            
    if do_plot:
        plt.figure(figsize=(iteration/10, Mshape / 20))
        plt.xlabel('Iteration')
        plt.ylabel('Rank, higher is better')
        plt.title('Rank Evolution')
        p = semilogx(range(1,iteration), rankings, '-,', alpha=0.5)
    return {fit_or_ubiq:scores[-1], 'iteration':iteration}

In [176]:

def w_star_analytic(M, alpha, beta, w_star_type):
    k_c  = M.sum(axis=1) #aka k_c summing over the rows
    k_p = M.sum(axis=0) #aka k_p summering over the columns
    
    A = 1
    B = 1
    
    def Gcp_denominateur(M, p, k_c, beta):
        M_p = M[:,p]
        k_c_beta = k_c ** (-1 * beta)
        return np.dot(M_p, k_c_beta)
    
    def Gpc_denominateur(M, c, k_p, alpha):
        M_c = M[c,:]
        k_p_alpha = k_p ** (-1 * alpha)
        return np.dot(M_c, k_p_alpha)
    
    if w_star_type == 'w_star_c':
        w_star_c = np.zeros(shape=M.shape[0])

        for c in range(M.shape[0]):
            summand = Gpc_denominateur(M, c, k_p, alpha)
            k_beta = (k_c[c] ** (-1 * beta))
            w_star_c[c] = A * summand * k_beta

        return w_star_c
    
    elif w_star_type == 'w_star_p':
        w_star_p = np.zeros(shape=M.shape[1])
    
        for p in range(M.shape[1]):
            summand = Gcp_denominateur(M, p, k_c, beta)
            k_alpha = (k_p[p] ** (-1 * alpha))
            w_star_p[p] = B * summand * k_alpha
    
        return w_star_p

In [184]:

#purer python
#score
w_scores = w_star_analytic(M=rap_data['M'], alpha=0.5, beta=0.5, w_star_type='w_star_p')
#identify
w_ranks = {name: w_scores[pos] for name, pos in rap_data['user_dict'].iteritems() }
#sort
w_ranks_sorted = sorted(w_ranks.iteritems(), key=operator.itemgetter(1))

#or use pandas
w_scores_df = pd.DataFrame.from_dict(w_ranks, orient='index')
w_scores_df.columns = ['w_score']
w_scores_df.sort(columns=['w_score'], ascending=False).head()

Out[184]:

	w_score
u1066901	4.648528
u112706	4.008362
u1200573	3.915877
u743272	3.797359
u1105397	3.509412

In [186]:

convergence = find_convergence(M=rap_data['M'], alpha=0.5, beta=0.5, fit_or_ubiq='fitness', do_plot=True)

In [245]:

'''I'm sure this can be done much more elegantly
but this was sort-of drink-a-lot-of-coffee-one-afternoon-and-get-it-done
cleaning this up is an exercise for the reader'''

def rank_comparison(a_ranks_sorted, b_ranks_sorted, do_plot=False):
    a_list = list()
    b_list = list()
    for atup in a_ranks_sorted:
        aiden = atup[0]
        apos = atup[1]
        #find this in our other list
        for btup in b_ranks_sorted:
            biden = btup[0]
            bpos = btup[1]
            if aiden == biden:
                a_list.append(apos)
                b_list.append(bpos)
    if do_plot:    
        plt.figure(figsize=(10,20))
        plot([1,2], [a_list, b_list], '-o')
        plt.show()
    
    return ss.spearmanr(a_list, b_list)

def calibrate_analytic(M, ua, exogenous_ranks_sorted, user_or_art_dict, index_function, title, do_plot=False):
    
    if ua == 'users':
        w_star_type = 'w_star_p'
    elif ua == 'articles':
        w_star_type = 'w_star_c'
    
    squarelen = range(0,25)
    
    alpha_range = map(index_function,squarelen)
    beta_range = map(index_function,squarelen)
    landscape = np.zeros(shape=(len(list(alpha_range)),len(list(beta_range))))

    top_spearman = {'spearman':None,'alpha':None, 'beta':None, 'ua':ua}

    for alpha_index, alpha in enumerate(alpha_range):
        for beta_index, beta in enumerate(beta_range):
            
            w_converged = w_star_analytic(M, alpha, beta, w_star_type)
            
            w_ranks = {name: w_converged[pos] for name, pos in user_or_art_dict.iteritems() }
            w_ranks_sorted = sorted(w_ranks.iteritems(), key=operator.itemgetter(1))
            
            spearman = rank_comparison(w_ranks_sorted, exogenous_ranks_sorted)

            if spearman[1] < 0.05:
                landscape[alpha_index][beta_index] = spearman[0]
                
                if (not top_spearman['spearman']) or (spearman[0] > top_spearman['spearman']):
                    top_spearman['spearman'] = spearman[0]
                    top_spearman['alpha'] = alpha
                    top_spearman['beta'] = beta
            else:
                landscape[alpha_index][beta_index] = np.nan

    if do_plot:
        plt.figure(figsize=(10,10))
        heatmap = imshow(landscape, interpolation='nearest', vmin=-1, vmax=1)
        #heatmap = plt.pcolor(landscape)
        colorbar = plt.colorbar(heatmap)
        plt.xlabel(r'$ \beta $')
        plt.xticks(squarelen, beta_range, rotation=90)
        plt.ylabel(r'$ \alpha $')
        plt.yticks(squarelen, alpha_range)
        plt.title(title)
        
        landscape_file = open(title+'_landscape.npy', 'w')
        np.save(landscape_file, landscape)
        plt.savefig(title+'_landscape.eps')

    return top_spearman

In [262]:

def calibrate(data, ua, name):
    if ua == 'users':
        exogenous_ranks_sorted, user_or_art_dict = data['user_exogenous_ranks'], data['user_dict']
    else:
        exogenous_ranks_sorted, user_or_art_dict = data['article_exogenous_ranks'], data['article_dict']

    spearman = calibrate_analytic(M=data['M'],
                                       ua=ua,
                                       exogenous_ranks_sorted=exogenous_ranks_sorted,
                                       user_or_art_dict=user_or_art_dict,
                                       index_function=lambda x: (x-12.5)/6.25, 
                                       title='Grid Search for %s correlation of %s' % (ua, name),
                                       do_plot=True)

    return {'rho':spearman['spearman'], 'alpha':spearman['alpha'], 'beta':spearman['beta']}

Analyse Users¶

In [247]:

subdomain_names = !ls geniusdata/

In [248]:

subdomain_data = {name: load_files('geniusdata/%s/' % name) for name in subdomain_names }