The drawing of the graph of pages linked when their share editors is very messy. It also show a non-pertinent structure to achieve our goal. On an another hand, the hyperlink graph is also not very pertinent because it is more about semantical and logical relationships between content that the after effect of intelligence shaping a knowledge landscape.
When trying to read a knowledge map for learning, it is important to keep the number of paths/edges in a low order scope. In that order, we propose here a strategy to reduce the number of edges while keeping a socio-semantical logic and reducing the influence of strong attractors like "Pi" or "mathematics".
%run "libraries.ipynb"
import networkx as nx
from IPython.display import display, HTML
In order to achieve our goal, we are going to start with the initial bi-partite graph composed by pages and editors.
# list of pages
pages = map(lambda x: x.strip(), codecs.open("data/pagenames.txt","r", "utf-8-sig").readlines())
# pages-editors bi-partite graph
pages_editors_graph = nx.read_gexf("data/pages-editors.gexf")
# pages graph projection from pages-editors bi-partite graph
pages_graph = nx.read_gexf("data/pages-linked-by-coeditors.gexf")
The main strategy we are going to use is not to go with computing out-degrees but only keep in-degrees. Big pages like "Pi" or "mathematics" tend to attract more co-editors and will overpower the other pages and break the logicality of the paths we propose by being to central.
The main hypothesis we propose here is that in a setting of recommanding the 3 next pages a user need to read a particular page pi are the 3 pages that rank pi at their top.
The implemented below is pretty straighforward:
g = nx.DiGraph()
for p in pages:
#print p
#print "==========="
# calculate rank in neighbor top co-edited ranking
nb = sorted(pages_graph["p:%s" % (p)].items(),
key=lambda (k,x): -int(x["coeditors"]))
for name, info in nb:
nb_mirror = sorted(pages_graph[name].items(),
key=lambda (k,x): -int(x["coeditors"]))
nb_mirror = [ x[0] for x in nb_mirror ]
editors = pages_editors_graph[name]
info["editors"] = len(editors)
info["exclusive editors"] = len([n for n in editors if len(pages_editors_graph[n]) == 1 ])
info["ranking"] = nb_mirror.index("p:%s" % (p)) + 1
# get the 3 pages to which the current page are in top ranking
nb2 = sorted(nb, key=lambda (x): x[1]["ranking"])
for name, info in nb2[0:3]:
#print name
g.add_edge(p, name.split(":")[1])
#print ""
print "reduced graph"
print "============="
print "nodes: %s" % (len(g.nodes()))
print "edges: %s" % (len(g.edges()))
print "reduction: %s/%s (%s)" % (len(g.edges()),
len(pages_editors_graph.edges()),
float(len(g.edges()))/float(len(pages_editors_graph.edges())))
reduced graph ============= nodes: 303 edges: 909 reduction: 909/39927 (0.0227665489518)
With only 303 nodes, edges are the main source of noise. Our method demonstrate that while keeping all node connected we also reduce drastically the number of edges by keeping only 2% of them.
In the following section, we are going to use the Louvain partitioning method in order to check if the reduced graph structure also keep important information and therefore provide use more readibility about our data.
All the muscle are done by Thomas Aynaud the implentation of into a python library.
import community
partitions = community.best_partition(g.to_undirected())
def print_groups(communities):
html = "<table>"
for c, ps in communities.iteritems():
html += "<tr><td style=\"width: 100px; text-align: center; \"><h3>group %s</h3></td><td>" % (c)
html += ", ".join(map(lambda x: u"<a href=\"http://en.wikipedia.org/wiki/{0}\" target=\"_blank\">{0}</a>".format(x), ps))
html += "</td></tr>"
html += "</table>"
display(HTML(html))
communities = {}
for k, v in partitions.iteritems():
communities.setdefault(v, []).append(k)
print_groups(communities)
The partition given by the louvain partitioning system is not that helpfull but it also isolate the big generalist/abstract nodes into the group 7.
nx.write_gexf(g, "data/reading_maps/pages-coedited-reduced-3.gexf")
We are going to plot an adjacency matrix combined with partitions in order to visualy check the pertinence of the underlying structure. It gives a preview of the possibilty to draw a map with a find colouring to differientiate the different group of nodes but also apply a spacialization that will separate them geometricaly.
from matplotlib import patches
ordered_nodes = [ n for c, ps in communities.iteritems() for n in ps ]
adjacency_matrix = nx.to_numpy_matrix(g, nodelist=ordered_nodes)
fig = plt.figure(figsize=(5, 5)) # in inches
plt.imshow(adjacency_matrix, cmap="Greys", interpolation="none")
ax = plt.gca()
#print communities
idx = 0
for i,c in communities.iteritems():
m = len(c)
ax.add_patch(patches.Rectangle((idx, idx),
m, # Width
m, # Height
facecolor="none",
edgecolor="red",
linewidth="0.5"))
idx += m
This is not working yet but you can check the pdf rendering made with gephi.
%%html
<div id="graph_container"></div>
%%javascript
require.config({paths: { 'sigma': "http://cdnjs.cloudflare.com/ajax/libs/sigma.js/1.0.3/sigma.require.min"}});
//require.config({paths: {sigmagexf: "https://raw.githubusercontent.com/jacomyal/sigma.js/master/plugins/sigma.parsers.gexf/sigma.parsers.gexf.js"}});
require(["sigma"], function(sigma){
console.log(sigma);
});