There are two tables below one concerning the origin of the user-"quality" and the other of article-"quality". Let us be clear, "quality" is determined by our exogenous metrics, in this case user-contribution totals, and article-contributor totals (over all *.genius subdomains). When we talk about the origin of the quality, we are discussing which variables we found most important in building a "Method of Reflections" ranking (loosely Google PageRank simulation) of the subdomain network that most closely matches the rankings given by the exogenous.
For instance for the users in lit.genius we find a Spearman $\rho$ correlation of about 0.6. That means we can predict user-contribution activity with a 60% correlation (and probably much higher with full data), just from knowing which users edit which texts (without knowing how much they edit each one).
As fantastical as that sounds, the result comes from an economics insight that the best economies produce the rarest products, and also ubiquitous ones - whereas worse economies produce only ubiquitous products, and hardly ever rare ones. The genius network is similar (see the matrix visualisations below), the top users edit the widest range of texts, and and casual users tend to edit texts that have already been annotated by others.
Going back to the lit.genius example, or method of reflection simulation is dependent on two inputs: $\alpha$, the importance of text-quality, and $\beta$ the importance of user-quality. Of the (25 $\alpha$ values * 25 $\beta$ values)=625 method of reflections simulations, we look for the one with the highest rank correlation, and then consider the $\alpha , \beta$ inputs that got us there (see grid search results below).
Now $\alpha, and \beta$ are negative exponents in our model see equation 13, so the lower or more negative they are the more important they are (think golf). Now if I say that user-quality increases close to linearly, $\beta = 0$, (or in some cases super-linearly $\beta< 0$) with regards to the quality of the editors editing the same texts as you, then we can consider that a "collaborative" subdomain.
Other analyses are available from this data - the importance of neighbouring text quality in predicting user quality, or the importance of neighbouring text quality in predicting text quality. However I argue that the $\beta$ measure for users, is the one that interprets best as collaborativeness.
The rank correlations are important here because they determine the underlying validity of the following results. In general they are lower than the correlations I have encountered on Wikipedia. However they range from about $(0.3, 0.6)$ compare to $(0.5, 0.9)$ on Wikipedia. The results presented here are statistically significant, so I would still consider this analysis valid and meaningful. The reasons this analysis could be lower are: (a) limited data supplied by genius (b) poorer exogenous metrics, IQ may be better or (c) genius users behave differently from other socio-technical websites.
(By the way, I encountered some errors that user_ids, were the same as text_ids. So in this analysis user ids have a prepended "u" and text ids a prepended "t".)
Sports, X, and Law, show themselves as the top 3 most collaborative with $\beta$ scores of $-0.72, 0.24, and, 0.4$ respectively . Sports in specific has a negative $\beta$ measure which means that user-quality is predicted super-linearly with regards the neighbouring users. I would predict that there are a group of users that have somehow organised themselves there. The only Wikipedia case found with a negative $\beta$ was US. Military History, which has a dedicated mailing list. Notice also from the text-user matrices that this is one of the tallest matrices, so there are relatively many texts edited per user compared to the others.
X doing very well however is a surprising result to because one would imagine that the "no specific subject nature" might make it more jungle-like and thus less collaborative. The results is actually quite well explained by considering the humorous but true "Zeroeth law of Wikipedia: Wikipedia only works in practice in theory it can never work." Counter-intuitively, people collaborate a better with less constraints rather than more. As people are given more freedoms online they respond well due to unrealised incentives. From an Wikipedian's perspective this is makes a lot of sense, that a company can never make decisions for the community as well as the community.
News, History, and Rock clock in as the least collaborative of the subdomains with $\beta$ measures of $1.84, 1.52, and, 1.52$. This doesn't mean that they are necessarily "uncollaborative" only less collaborative than the other subdomains. Essentially user-quality is a sub-linear function of the quality of the neighbouring users. Therefore users are not helping one another - as much - to become better users. A naïve explanation might be that since news-commenting and events-comment already have a precedence online for being unproductive pools of back-and-forth argumentation, users unwittingly transfer this behaviour onto the genius website. Even though genius has a different goal to the comment sections of a news websites, users are preconditioned by those sites to not necessarily want to build together an communal annotation.
The biggest surprise is that X doing well. This is evidence that an opening up of topics allows users to self-organise better than with a topic constraint. I would present this as push to trusting the user base in defining their own scopes.
user_df = pd.DataFrame.from_dict(user_calibrations, orient='index')
user_df
alpha | beta | rho | |
---|---|---|---|
history | -1.84 | 1.52 | 0.481851 |
law | 0.40 | 0.40 | 0.429294 |
lit | -0.08 | 0.88 | 0.594491 |
news | 0.08 | 1.84 | 0.353811 |
pop | -1.68 | 0.56 | 0.271077 |
r-b | -1.52 | 0.56 | 0.297127 |
rap | -1.84 | 0.56 | 0.445590 |
rock | -1.68 | 1.52 | 0.476813 |
screen | -2.00 | 1.36 | 0.469442 |
sports | -2.00 | -0.72 | 0.307567 |
tech | -2.00 | 0.88 | 0.473089 |
x | -2.00 | 0.24 | 0.369506 |
article_df = pd.DataFrame.from_dict(article_calibrations, orient='index')
article_df
alpha | beta | rho | |
---|---|---|---|
history | 0.24 | -2.00 | 0.551461 |
law | 0.40 | -1.20 | 0.401671 |
lit | 1.52 | -0.08 | 0.342453 |
news | 0.24 | 0.56 | 0.372917 |
pop | 0.08 | 0.72 | 0.368729 |
r-b | 0.08 | 0.72 | 0.735256 |
rap | 0.40 | 0.72 | 0.262893 |
rock | 0.24 | -0.24 | 0.222322 |
screen | 0.08 | -0.88 | 0.434323 |
sports | 0.08 | -2.00 | 0.376457 |
tech | 0.40 | 0.56 | 0.660880 |
x | 0.24 | -1.52 | 0.321988 |
for name, bipartite_dict in bipartite_dicts.iteritems():
M, text_dict, user_dict = viz_bipartite(bipartite_dict, name)
np.save('geniusdata/'+name+'/M.npy', M)
json.dump(text_dict, open('geniusdata/'+name+'/text_dict.json', 'w'))
json.dump(user_dict, open('geniusdata/'+name+'/user_dict.json', 'w'))
text_exogenous_ranks = make_exogenous_ranks(text_dict, all_text_exogenous)
user_exogenous_ranks = make_exogenous_ranks(user_dict, all_user_exogenous)
json.dump(text_exogenous_ranks, open('geniusdata/'+name+'/text_exogenous_ranks.json', 'w'))
json.dump(user_exogenous_ranks, open('geniusdata/'+name+'/user_exogenous_ranks.json', 'w'))
<matplotlib.figure.Figure at 0x7f87ef967650>
<matplotlib.figure.Figure at 0x7f87f10eb510>
<matplotlib.figure.Figure at 0x7f87ef3a0550>
<matplotlib.figure.Figure at 0x7f87efd6b2d0>
<matplotlib.figure.Figure at 0x7f87efb0a150>
<matplotlib.figure.Figure at 0x7f87eeae2190>
<matplotlib.figure.Figure at 0x7f87f0feaa90>
<matplotlib.figure.Figure at 0x7f87efae0cd0>
<matplotlib.figure.Figure at 0x7f87f10022d0>
<matplotlib.figure.Figure at 0x7f87eda85390>
<matplotlib.figure.Figure at 0x7f87ebbc7b90>
<matplotlib.figure.Figure at 0x7f87ef9cbd90>
<matplotlib.figure.Figure at 0x7f87ef8b2e90>
<matplotlib.figure.Figure at 0x7f87ecf7c110>
<matplotlib.figure.Figure at 0x7f87ed8ea4d0>
<matplotlib.figure.Figure at 0x7f87ef9ba8d0>
<matplotlib.figure.Figure at 0x7f87efb1b690>
<matplotlib.figure.Figure at 0x7f87ebbc7650>
<matplotlib.figure.Figure at 0x7f87ef3aa590>
<matplotlib.figure.Figure at 0x7f87ef8b2a10>
<matplotlib.figure.Figure at 0x7f87eec3b5d0>
<matplotlib.figure.Figure at 0x7f87ef3b4610>
<matplotlib.figure.Figure at 0x7f87f0d10310>
<matplotlib.figure.Figure at 0x7f87ed8ea590>
article_calibrations = dict()
for name, data in subdomain_data.iteritems():
calibrations = calibrate(data, 'articles', name)
print name, calibrations
article_calibrations[name] = calibrations
r-b {'alpha': 0.08, 'beta': 0.72, 'rho': 0.73525631322687313} screen {'alpha': 0.08, 'beta': -0.88, 'rho': 0.43432251032103325} pop {'alpha': 0.08, 'beta': 0.72, 'rho': 0.3687289516535695} sports {'alpha': 0.08, 'beta': -2.0, 'rho': 0.37645660335704706} lit {'alpha': 1.52, 'beta': -0.08, 'rho': 0.34245319801714619} tech {'alpha': 0.4, 'beta': 0.56, 'rho': 0.6608802421201827} x {'alpha': 0.24, 'beta': -1.52, 'rho': 0.32198843556375556} rap {'alpha': 0.4, 'beta': 0.72, 'rho': 0.26289283884165071} rock {'alpha': 0.24, 'beta': -0.24, 'rho': 0.22232154708118684} news {'alpha': 0.24, 'beta': 0.56, 'rho': 0.37291663312294737} law {'alpha': 0.4, 'beta': -1.2, 'rho': 0.40167098257037015} history {'alpha': 0.24, 'beta': -2.0, 'rho': 0.55146056262305387}
user_calibrations = dict()
for name, data in subdomain_data.iteritems():
calibrations = calibrate(data, 'users', name)
print name, calibrations
user_calibrations[name] = calibrations
r-b {'alpha': -1.52, 'beta': 0.56, 'rho': 0.29712697513930469} screen {'alpha': -2.0, 'beta': 1.36, 'rho': 0.46944201868643853} pop {'alpha': -1.68, 'beta': 0.56, 'rho': 0.27107665067574438} sports {'alpha': -2.0, 'beta': -0.72, 'rho': 0.30756706463723915} lit {'alpha': -0.08, 'beta': 0.88, 'rho': 0.59449131632269803} tech {'alpha': -2.0, 'beta': 0.88, 'rho': 0.4730886363885195} x {'alpha': -2.0, 'beta': 0.24, 'rho': 0.36950582819547689} rap {'alpha': -1.84, 'beta': 0.56, 'rho': 0.44559011704436363} rock {'alpha': -1.68, 'beta': 1.52, 'rho': 0.47681330924442988} news {'alpha': 0.08, 'beta': 1.84, 'rho': 0.3538108376542729} law {'alpha': 0.4, 'beta': 0.4, 'rho': 0.42929372670471688} history {'alpha': -1.84, 'beta': 1.52, 'rho': 0.48185092927104095}
import json
import numpy as np
import networkx as nx
from collections import defaultdict
import operator
import pandas as pd
import scipy.stats as ss
Take a look at the attached files. They are:
tag_dictionary.json
text_dictionary.json - maps text ids to their tag id, their annotating user ids, and the total number of annotations
user_dictionary.json - maps user ids to a measure of total activity on the site```
!ls geniusdata/
history lit pop r-b screen tag_dictionary.json text_dictionary.json x law news rap rock sports tech user_dictionary.json
genius_data = {prefix : json.load(open('geniusdata/%s_dictionary.json' % prefix, 'r')) for prefix in ['tag','text','user']}
A quick verification that the data is consistent
users = set()
for text_id, text_dict in genius_data['text'].iteritems():
for user in text_dict['annotating_users']:
users.add(user)
print len(users) == len(genius_data['user'])
True
#need to reverse the tag dict
id_to_tag = dict()
for name, tag_id in genius_data['tag'].iteritems():
id_to_tag[tag_id] = name
subdomains = defaultdict(dict)
for text_id, text_dict in genius_data['text'].iteritems():
tag_id = text_dict['tag_id']
tag_name = id_to_tag[tag_id]
subdomains[tag_name][text_id] = text_dict
print {subdomain_name: len(subdomain_dict) for subdomain_name, subdomain_dict in subdomains.iteritems()}
{u'lit': 1000, u'screen': 718, u'pop': 1000, u'sports': 1000, u'r-b': 247, u'tech': 71, u'rap': 1000, u'rock': 1000, u'x': 961, u'news': 1000, u'law': 153, u'history': 507}
hmm looks like someone wrote some SQL query that limited something to 1000 items. I wonder if 'x' and 'history' actually have less than 1000 articles in them?
def make_bipartite_dict(subdomain_dict):
return {'t'+text_id: ['u'+str(user) for user in text_dict['annotating_users']] for text_id, text_dict in subdomain_dict.iteritems()}
bipartite_dicts = {subdomain_name: make_bipartite_dict(subdomain_dict) for subdomain_name, subdomain_dict in subdomains.iteritems()}
def viz_bipartite(bipartite, name):
bipartite_G = nx.Graph()
text_encountered=list()
user_encountered=list()
for text, user_list in bipartite.iteritems():
text_encountered.append(text)
for user in user_list:
bipartite_G.add_edge(text, user)
if user not in user_encountered:
user_encountered.append(user)
M = nx.algorithms.bipartite.basic.biadjacency_matrix(G=bipartite_G, row_order=text_encountered, column_order=user_encountered)
#i wish numpy wasn't so procedural so I didn't have to do this, pandas does it right by return the object instead of doing it in place
return_M = M.copy()
text_dict = {text: text_encountered.index(text) for text in text_encountered}
user_dict = {user: user_encountered.index(user) for user in user_encountered}
fig = imshow(M, cmap=plt.cm.gray_r, interpolation='nearest')
fig.axes.set_title(' "Raw" Text-User Matrix for %s' % name )
fig.axes.set_xlabel('users')
fig.axes.set_ylabel('texts')
plt.figure(figsize(15,15))
plt.show()
M.sort(axis=0)
M.sort(axis=1)
fig = imshow(M, cmap=plt.cm.gray_r, interpolation='nearest')
fig.axes.set_xlim(fig.axes.get_xlim()[::-1])
fig.axes.set_title(' "Sorted" Text-User Matrix for %s' % name )
fig.axes.set_xlabel('users')
fig.axes.set_ylabel('texts')
plt.figure(figsize(15,15))
plt.show()
return return_M, text_dict, user_dict
all_user_exogenous = {'u'+user_id: activity for user_id, activity in genius_data['user'].iteritems()}
all_text_exogenous = {'t'+text_id: text_dict['annotations_count'] for text_id, text_dict in genius_data['text'].iteritems()}
def make_exogenous_ranks(specific_dict, all_dict):
exogenous_scores = [(user_id, score) for user_id, score in all_dict.iteritems() if user_id in specific_dict.keys()]
exogenous_scores_order = sorted(exogenous_scores, key=lambda tup: tup[1])
exogenous_ranks_order = [tup[0] for tup in exogenous_scores_order]
exogenous_ranks = zip(exogenous_ranks_order, range(len(exogenous_ranks_order)))
return exogenous_ranks
'''for subdomain_name, bipartite_dict in bipartite_dicts.iteritems():
print subdomain_name
plt.figure(figsize=(10,10))
bipartite_network = nx.Graph(data=bipartite_dict)
#TODO colouring the nodetypes
nx.draw_spring(bipartite_network)
'''
'for subdomain_name, bipartite_dict in bipartite_dicts.iteritems():\n print subdomain_name\n plt.figure(figsize=(10,10))\n bipartite_network = nx.Graph(data=bipartite_dict)\n #TODO colouring the nodetypes\n nx.draw_spring(bipartite_network)\n '
def load_files(folder):
M = np.load(folder+'M.npy')
user_dict = json.load(open(folder+'user_dict.json', 'r'))
article_dict = json.load(open(folder+'text_dict.json', 'r'))
user_exogenous_ranks = json.load(open(folder+'user_exogenous_ranks.json', 'r'))
article_exogenous_ranks = json.load(open(folder+'text_exogenous_ranks.json', 'r'))
return {'M':M,
'user_dict':user_dict,
'article_dict':article_dict,
'user_exogenous_ranks':user_exogenous_ranks,
'article_exogenous_ranks':article_exogenous_ranks}
rap_data = load_files('geniusdata/rap/')
def Gcp_denominateur(M, p, k_c, beta):
M_p = M[:,p]
k_c_beta = k_c ** (-1 * beta)
return np.dot(M_p, k_c_beta)
def Gpc_denominateur(M, c, k_p, alpha):
M_c = M[c,:]
k_p_alpha = k_p ** (-1 * alpha)
return np.dot(M_c, k_p_alpha)
def make_G_hat(M, alpha=1, beta=1):
'''G hat is Markov chain of length 2
Gcp is a matrix to go from contries to product and then
Gpc is a matrix to go from products to ccountries'''
k_c = M.sum(axis=1) #aka k_c summing over the rows
k_p = M.sum(axis=0) #aka k_p summering over the columns
G_cp = np.zeros(shape=M.shape)
#Gcp_beta
for [c, p], val in np.ndenumerate(M):
numerateur = (M[c,p]) * (k_c[c] ** ((-1) * beta))
denominateur = Gcp_denominateur(M, p, k_c, beta)
G_cp[c,p] = numerateur / float(denominateur)
G_pc = np.zeros(shape=M.T.shape)
#Gpc_alpha
for [p, c], val in np.ndenumerate(M.T):
numerateur = (M.T[p,c]) * (k_p[p] ** ((-1) * alpha))
denominateur = Gpc_denominateur(M, c, k_p, alpha)
G_pc[p,c] = numerateur / float(denominateur)
return {'G_cp': G_cp, "G_pc" : G_pc}
def w_generator(M, alpha, beta):
#this cannot return the zeroeth iteration
G_hat = make_G_hat(M, alpha, beta)
G_cp = G_hat['G_cp']
G_pc = G_hat['G_pc']
#
fitness_0 = np.sum(M,1)
ubiquity_0 = np.sum(M,0)
fitness_next = fitness_0
ubiquity_next = ubiquity_0
i = 0
while True:
fitness_prev = fitness_next
ubiquity_prev = ubiquity_next
i += 1
fitness_next = np.sum( G_cp*ubiquity_prev, axis=1 )
ubiquity_next = np.sum( G_pc* fitness_prev, axis=1 )
yield {'iteration':i, 'fitness': fitness_next, 'ubiquity': ubiquity_next}
def w_stream(M, i, alpha, beta):
"""gets the i'th iteration of reflections of M,
but in a memory safe way so we can calculate many generations"""
if i < 0:
raise ValueError
for j in w_generator(M, alpha, beta):
if j[0] == i:
return {'fitness': j[1], 'ubiquity': j[2]}
break
def find_convergence(M, alpha, beta, fit_or_ubiq, do_plot=False,):
'''finds the convergence point (or gives up after 1000 iterations)'''
if fit_or_ubiq == 'fitness':
Mshape = M.shape[0]
elif fit_or_ubiq == 'ubiquity':
Mshape = M.shape[1]
rankings = list()
scores = list()
prev_rankdata = np.zeros(Mshape)
iteration = 0
for stream_data in w_generator(M, alpha, beta):
iteration = stream_data['iteration']
data = stream_data[fit_or_ubiq]
rankdata = data.argsort().argsort()
#test for convergence
if np.equal(rankdata,prev_rankdata).all():
break
if iteration == 1000:
break
else:
rankings.append(rankdata)
scores.append(data)
prev_rankdata = rankdata
if do_plot:
plt.figure(figsize=(iteration/10, Mshape / 20))
plt.xlabel('Iteration')
plt.ylabel('Rank, higher is better')
plt.title('Rank Evolution')
p = semilogx(range(1,iteration), rankings, '-,', alpha=0.5)
return {fit_or_ubiq:scores[-1], 'iteration':iteration}
def w_star_analytic(M, alpha, beta, w_star_type):
k_c = M.sum(axis=1) #aka k_c summing over the rows
k_p = M.sum(axis=0) #aka k_p summering over the columns
A = 1
B = 1
def Gcp_denominateur(M, p, k_c, beta):
M_p = M[:,p]
k_c_beta = k_c ** (-1 * beta)
return np.dot(M_p, k_c_beta)
def Gpc_denominateur(M, c, k_p, alpha):
M_c = M[c,:]
k_p_alpha = k_p ** (-1 * alpha)
return np.dot(M_c, k_p_alpha)
if w_star_type == 'w_star_c':
w_star_c = np.zeros(shape=M.shape[0])
for c in range(M.shape[0]):
summand = Gpc_denominateur(M, c, k_p, alpha)
k_beta = (k_c[c] ** (-1 * beta))
w_star_c[c] = A * summand * k_beta
return w_star_c
elif w_star_type == 'w_star_p':
w_star_p = np.zeros(shape=M.shape[1])
for p in range(M.shape[1]):
summand = Gcp_denominateur(M, p, k_c, beta)
k_alpha = (k_p[p] ** (-1 * alpha))
w_star_p[p] = B * summand * k_alpha
return w_star_p
#purer python
#score
w_scores = w_star_analytic(M=rap_data['M'], alpha=0.5, beta=0.5, w_star_type='w_star_p')
#identify
w_ranks = {name: w_scores[pos] for name, pos in rap_data['user_dict'].iteritems() }
#sort
w_ranks_sorted = sorted(w_ranks.iteritems(), key=operator.itemgetter(1))
#or use pandas
w_scores_df = pd.DataFrame.from_dict(w_ranks, orient='index')
w_scores_df.columns = ['w_score']
w_scores_df.sort(columns=['w_score'], ascending=False).head()
w_score | |
---|---|
u1066901 | 4.648528 |
u112706 | 4.008362 |
u1200573 | 3.915877 |
u743272 | 3.797359 |
u1105397 | 3.509412 |
convergence = find_convergence(M=rap_data['M'], alpha=0.5, beta=0.5, fit_or_ubiq='fitness', do_plot=True)