Notebook

Using Bibliographic Coupling to Identify Seminal Works¶

Sometimes its helpful to identify the seminal papers in a certain field, the ones that introduced revolutionary new concepts or applied them in interesting ways. One way to do this is with Bibliographic Coupling. If you know one of the archetypal papers in a field, looking at who cites that paper and what else they cite should tell you what other works are important.

This notebook implements that in a simple way. It doesn't do any of the data collection, because I haven't worked out how to ask it to. (Maybe web-of-science has an api?)

To get started:

Go to Web of Science
Search for and find your seed paper
Click through to the list of citing papers (papers that cite the seed)
You may choose to sort the papers by citation, or author, to try and get a random sample
Drop down the save-to menu and select 'other file formats'
Select records 1 through 500 (or 501-1000, etc)
In 'Record Content' choose 'Full Record and Cited References'
In 'File Format' select 'Tab-Delimited (Win, UTF-8)'
Save to the working directory of this notebook

In [1]:

import pandas as pd
import networkx as nx
import re

Couldn't import dot_parser, loading of dot files will not be possible.

Import the data¶

We'll load the data into a pandas dataframe, and plot a histogram of the citations by year to get a sense for when the citations were made, and make sure that our dataset is somewhat representative of the timeperiod that the field has been operational.

In [2]:

refs = pd.read_csv('Wiki_MCMC.txt', sep='\t', index_col=False, encoding='utf-8')
plt.hist(refs['PY']);

Load the data into a network structure¶

We can pull data for each of the papers citing the seed from their columns in the data table. The other citations need to be parsed from a single text field in the data table. To do this, we'll first split the string into an array of references to other papers (they are delimited with a semicolon ';') and then apply a regular expression to each reference.

We create nodes for each paper and directional links between them representing citations.

There will be some 'bad' nodes, cases in which the DOI was not listed with the citation, and we'll want to throw these out.

A list of the keys is available here: http://tools.medialab.sciences-po.fr/sciencescape/wok_utils.php

In [3]:

CG = nx.DiGraph()

pattern = '\s+(?P<Author>.*?),(\s+(?P<Year>\d{4}),)?(\s+(?P<Journal>.*?),)?(\s+V(?P<Volume>\d*?),)?(\s+P(?P<Page>\d*?),)?(\s+DOI\s+(?P<DOI>.*?))?;'

for index, ref in refs.iterrows():
    
    node1_attr = {'Author':ref['AU'], 'Year':ref['PY'], 'Journal':ref['SO'], 
                  'Volume':ref['VL'], 'Page':ref['BP'], 'DOI':ref['DI']}
    
    CG.add_node(node1_attr['DOI'], attr_dict=node1_attr)
    
    try:
        for cite in ref['CR'].split(';'):
            matches = re.finditer(pattern, cite+';')
            for m in matches:
                node2_attr = m.groupdict()
                CG.add_node(node2_attr['DOI'], attr_dict=node2_attr)
                CG.add_edge(node1_attr['DOI'], node2_attr['DOI'])
    except:
        pass

try:
    CG.remove_node(NaN)
except:
    pass

print CG.order(), 'nodes added'

6166 nodes added

Identify the papers that have been cited the largest number of times by our data set¶

What we really care about here is in-degree, the number of times a paper has been cited - not out-degree, the number of times a paper has cited someone else. We can sort the in-degree list and print the nodes of the top ten papers.

The first should clearly be the seed paper - by definition, every row in our dataset referenced this paper. The second and ongoing papers represent the most important other papers to the papers that cited the seed. (The language here gets complex quickly...)

In [4]:

in_degree_list = pd.Series(CG.in_degree())
in_degree_list.sort(ascending=False)

for citation, degree in in_degree_list[:10].iteritems():
    print degree, CG.node[citation]

227 {'DOI': u'10.1023/A:1020281327116', 'Author': u'Andrieu C', 'Journal': u'MACH LEARN', 'Volume': u'50', 'Year': u'2003', 'Page': u'5'}
32 {'DOI': u'10.2307/2334940', 'Author': u'HASTINGS WK', 'Journal': u'BIOMETRIKA', 'Volume': u'57', 'Year': u'1970', 'Page': u'97'}
29 {'DOI': u'10.1063/1.1699114', 'Author': u'METROPOLIS N', 'Journal': u'J CHEM PHYS', 'Volume': u'21', 'Year': u'1953', 'Page': u'1087'}
22 {'DOI': u'10.1162/jmlr.2003.3.4-5.993', 'Author': u'Blei DM', 'Journal': u'J MACH LEARN RES', 'Volume': u'3', 'Year': u'2003', 'Page': u'993'}
21 {'DOI': u'10.1093/biomet/82.4.711', 'Author': u'Green PJ', 'Journal': u'BIOMETRIKA', 'Volume': u'82', 'Year': u'1995', 'Page': u'711'}
13 {'DOI': u'10.1073/pnas.0307752101', 'Author': u'Griffiths TL', 'Journal': u'P NATL ACAD SCI USA', 'Volume': u'101', 'Year': u'2004', 'Page': u'5228'}
13 {'DOI': u'10.2307/2684568', 'Author': u'CHIB S', 'Journal': u'AM STAT', 'Volume': u'49', 'Year': u'1995', 'Page': u'327'}
11 {'DOI': u'10.1109/78.978374', 'Author': u'Arulampalam MS', 'Journal': u'IEEE T SIGNAL PROCES', 'Volume': u'50', 'Year': u'2002', 'Page': u'174'}
10 {'DOI': u'10.2307/2289776', 'Author': u'GELFAND AE', 'Journal': u'J AM STAT ASSOC', 'Volume': u'85', 'Year': u'1990', 'Page': u'398'}
10 {'DOI': u'10.1214/aos/1176325750', 'Author': u'TIERNEY L', 'Journal': u'ANN STAT', 'Volume': u'22', 'Year': u'1994', 'Page': u'1701'}

Plot the result¶

Just for kicks, let's put together a network diagram.

There are too many nodes to plot outright, and many of those nodes don't share any citations with other nodes, so they just take up space.

To create a smaller diaram, we'll first

Identify the top n papers that we want to look at (excluding the first, which is the seed, and linked to everything)
Get all the edges that are connected to those papers
Get all the nodes that are connected to those edges
Get all the edges connected to those nodes
Prune any straggling nodes

This diagram isn't that helpful, because it's hard to interact with, but it does tell us that a lot of these papers interact with each other, and not just the seed and the top result. There seem to be some clusters that it would be interesting to explore down the road.

Sometime, I'll try and make an interactive graph that tells you more about the actual nodes involved, and lets us do that sort of exploration.

In [5]:

#Get the first and second connected nodes/edges
first_node_list = in_degree_list[1:30].index
first_edge_list = CG.to_undirected().edges(nbunch=first_node_list)
second_node_list = [edge[1] for edge in first_edge_list]
second_edge_list = CG.to_undirected().edges(nbunch=second_node_list)

#Create a network graph
small = nx.DiGraph(second_edge_list)
print small.order(), 'possible nodes pruned to',

small.remove_nodes_from(set(small.nodes())-set(second_node_list).union(set(first_node_list)))
print small.order()

3451 possible nodes pruned to 187

In [6]:

#plot the subgraph
plt.figure(figsize=(20,20))
nx.draw(small.to_undirected())