Sometimes its helpful to identify the seminal papers in a certain field, the ones that introduced revolutionary new concepts or applied them in interesting ways. One way to do this is with Bibliographic Coupling. If you know one of the archetypal papers in a field, looking at who cites that paper and what else they cite should tell you what other works are important.
This notebook implements that in a simple way. It doesn't do any of the data collection, because I haven't worked out how to ask it to. (Maybe web-of-science has an api?)
To get started:
import pandas as pd
import networkx as nx
import re
Couldn't import dot_parser, loading of dot files will not be possible.
We'll load the data into a pandas dataframe, and plot a histogram of the citations by year to get a sense for when the citations were made, and make sure that our dataset is somewhat representative of the timeperiod that the field has been operational.
refs = pd.read_csv('Wiki_MCMC.txt', sep='\t', index_col=False, encoding='utf-8')
plt.hist(refs['PY']);
We can pull data for each of the papers citing the seed from their columns in the data table. The other citations need to be parsed from a single text field in the data table. To do this, we'll first split the string into an array of references to other papers (they are delimited with a semicolon ';') and then apply a regular expression to each reference.
We create nodes for each paper and directional links between them representing citations.
There will be some 'bad' nodes, cases in which the DOI was not listed with the citation, and we'll want to throw these out.
A list of the keys is available here: http://tools.medialab.sciences-po.fr/sciencescape/wok_utils.php
CG = nx.DiGraph()
pattern = '\s+(?P<Author>.*?),(\s+(?P<Year>\d{4}),)?(\s+(?P<Journal>.*?),)?(\s+V(?P<Volume>\d*?),)?(\s+P(?P<Page>\d*?),)?(\s+DOI\s+(?P<DOI>.*?))?;'
for index, ref in refs.iterrows():
node1_attr = {'Author':ref['AU'], 'Year':ref['PY'], 'Journal':ref['SO'],
'Volume':ref['VL'], 'Page':ref['BP'], 'DOI':ref['DI']}
CG.add_node(node1_attr['DOI'], attr_dict=node1_attr)
try:
for cite in ref['CR'].split(';'):
matches = re.finditer(pattern, cite+';')
for m in matches:
node2_attr = m.groupdict()
CG.add_node(node2_attr['DOI'], attr_dict=node2_attr)
CG.add_edge(node1_attr['DOI'], node2_attr['DOI'])
except:
pass
try:
CG.remove_node(NaN)
except:
pass
print CG.order(), 'nodes added'
6166 nodes added
What we really care about here is in-degree, the number of times a paper has been cited - not out-degree, the number of times a paper has cited someone else. We can sort the in-degree list and print the nodes of the top ten papers.
The first should clearly be the seed paper - by definition, every row in our dataset referenced this paper. The second and ongoing papers represent the most important other papers to the papers that cited the seed. (The language here gets complex quickly...)
in_degree_list = pd.Series(CG.in_degree())
in_degree_list.sort(ascending=False)
for citation, degree in in_degree_list[:10].iteritems():
print degree, CG.node[citation]
227 {'DOI': u'10.1023/A:1020281327116', 'Author': u'Andrieu C', 'Journal': u'MACH LEARN', 'Volume': u'50', 'Year': u'2003', 'Page': u'5'} 32 {'DOI': u'10.2307/2334940', 'Author': u'HASTINGS WK', 'Journal': u'BIOMETRIKA', 'Volume': u'57', 'Year': u'1970', 'Page': u'97'} 29 {'DOI': u'10.1063/1.1699114', 'Author': u'METROPOLIS N', 'Journal': u'J CHEM PHYS', 'Volume': u'21', 'Year': u'1953', 'Page': u'1087'} 22 {'DOI': u'10.1162/jmlr.2003.3.4-5.993', 'Author': u'Blei DM', 'Journal': u'J MACH LEARN RES', 'Volume': u'3', 'Year': u'2003', 'Page': u'993'} 21 {'DOI': u'10.1093/biomet/82.4.711', 'Author': u'Green PJ', 'Journal': u'BIOMETRIKA', 'Volume': u'82', 'Year': u'1995', 'Page': u'711'} 13 {'DOI': u'10.1073/pnas.0307752101', 'Author': u'Griffiths TL', 'Journal': u'P NATL ACAD SCI USA', 'Volume': u'101', 'Year': u'2004', 'Page': u'5228'} 13 {'DOI': u'10.2307/2684568', 'Author': u'CHIB S', 'Journal': u'AM STAT', 'Volume': u'49', 'Year': u'1995', 'Page': u'327'} 11 {'DOI': u'10.1109/78.978374', 'Author': u'Arulampalam MS', 'Journal': u'IEEE T SIGNAL PROCES', 'Volume': u'50', 'Year': u'2002', 'Page': u'174'} 10 {'DOI': u'10.2307/2289776', 'Author': u'GELFAND AE', 'Journal': u'J AM STAT ASSOC', 'Volume': u'85', 'Year': u'1990', 'Page': u'398'} 10 {'DOI': u'10.1214/aos/1176325750', 'Author': u'TIERNEY L', 'Journal': u'ANN STAT', 'Volume': u'22', 'Year': u'1994', 'Page': u'1701'}
Just for kicks, let's put together a network diagram.
There are too many nodes to plot outright, and many of those nodes don't share any citations with other nodes, so they just take up space.
To create a smaller diaram, we'll first
This diagram isn't that helpful, because it's hard to interact with, but it does tell us that a lot of these papers interact with each other, and not just the seed and the top result. There seem to be some clusters that it would be interesting to explore down the road.
Sometime, I'll try and make an interactive graph that tells you more about the actual nodes involved, and lets us do that sort of exploration.
#Get the first and second connected nodes/edges
first_node_list = in_degree_list[1:30].index
first_edge_list = CG.to_undirected().edges(nbunch=first_node_list)
second_node_list = [edge[1] for edge in first_edge_list]
second_edge_list = CG.to_undirected().edges(nbunch=second_node_list)
#Create a network graph
small = nx.DiGraph(second_edge_list)
print small.order(), 'possible nodes pruned to',
small.remove_nodes_from(set(small.nodes())-set(second_node_list).union(set(first_node_list)))
print small.order()
3451 possible nodes pruned to 187
#plot the subgraph
plt.figure(figsize=(20,20))
nx.draw(small.to_undirected())