This code accompanies this blog post.
Source data is available at Journal Metrics and at Directory of Open Access Journals.
Please not that I have not been able to sort the open access and closed access journals perfectly (see authors note below). There are likely entries that appear in both databases. If you find the bug, please let me know so I can update the results. This bug will affect the results, but I don't expect the influence to be dramatic.
If you reuse any code or figures, please credit Caitlin Rivers and include a link to my work. I'd also love if you let me know @cmyeaton.
Open access plots are in green, closed access plots are in blue/purple, and magenta are some combination or comparison.
from __future__ import division
import pandas as pd
pd.set_printoptions(max_rows=100, max_columns=10)
from scipy import stats
import matplotlib.pyplot as plt
impact = pd.read_csv('SNIP_SJR_complete_1999_2011new_SNIP_and_SJR_v1_Oct_2012.csv')
open_access = pd.read_csv('open_access_journals.csv')
The ISSN data are somewhat messy. The open_access ISSN numbers have a - in the middle. Some of the impact ISSNs have trailing spaces.
def rm_issn_punc(x):
import re
punc = re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ", x)
space = ''.join(punc.split(" "))
return space
def strip_space(x):
try:
new = int(y)
except:
new = str(x).strip(' ')
return new
def membership(x):
blean = True
open_lst = np.array(open_access.issn_)
if x in open_lst:
blean = False
return blean
open_access['issn'] = open_access['ISSN'].map(rm_issn_punc)
open_access['issn_'] = open_access['issn'].map(strip_space)
impact['issn'] = impact['Print ISSN'].map(strip_space)
impact['closed'] = impact['issn'].map(membership)
closed = impact[impact.closed == True]
Merged the open access list and the impact db by issn number to avoid character encoding differences (e.g. see row 18 below)
matches = pd.merge(impact, open_access, left_on=impact['issn'], right_on=['issn']).drop_duplicates()
matches = matches.dropna(how='all')
matches = matches.drop_duplicates(cols='Title')
matches[['Source Title', 'Title']].head(20)
Source Title | Title | |
---|---|---|
0 | Revista de Microbiologia | Revista de Microbiologia |
1 | Anais da Academia Brasileira de Ciencias | Anais da Academia Brasileira de Ciencias. |
2 | Acta Adriatica | Acta Adriatica |
3 | Acta Biochimica Polonica | Acta Biochimica Polonica |
4 | Acta Biologica Cracoviensia Series Botanica | Acta Biologica Cracoviensia Series Botanica |
5 | Acta Dermato-Venereologica | Acta Dermato-Venereologica |
6 | Acta Geologica Polonica | Acta Geologica Polonica |
7 | Acta Medica Nagasakiensia | Acta Medica Nagasakiensia |
8 | Acta Palaeobotanica | Acta Palaeobotanica |
9 | Acta Societatis Botanicorum Poloniae | Acta Societatis Botanicorum Poloniae |
10 | Acta Stomatologica Croatica | Acta Stomatologica Croatica |
11 | Acta Veterinaria Brno | Acta Veterinaria Brno |
12 | Acta Zoologica Sinica | Acta Zoologica Sinica |
13 | Afrika Spectrum | Africa Spectrum |
14 | American Journal of Pharmaceutical Education | American Journal of Pharmaceutical Education |
15 | Notices of the American Mathematical Society | Notices of the American Mathematical Society. |
16 | American Museum Novitates | American Museum Novitates |
17 | Bulletin of the American Museum of Natural Histor | Bulletin of the American Museum of Natural Histor |
18 | Analise Social | Análise Social |
19 | Angle Orthodontist | Angle Orthodontist |
closed.issn + open_access.issn should == impact.issn, but it doesn't. There is clearly a bug somewhere, but I can't find it even after extensive searching. I suspect there is an underlying problem with the format of the ISSNs. Feel free to take a crack at it.
There are likely some open access journals in the closed category, which may make difference between the two categories less stark than they may otherwise be.
print 'Closed: {0}, Open: {1}, Full list: {2}'.format(len(closed.issn.unique()), len(open_access.issn_.unique()), len(impact.issn.unique()))
Closed: 28235, Open: 8597, Full list: 30774
definitions from http://www.journalmetrics.com/faq.php
BY SNIP
SNIP, or Source-Normalized Impact per Paper, measures a source’s contextual citation impact. It takes into account characteristics of the source's subject field, especially the frequency at which authors cite other papers in their reference lists, the speed at which citation impact matures, and the extent to which the database used in the assessment covers the field’s literature. SNIP is the ratio of a source's average citation count per paper, and the ‘citation potential’ of its subject field. It aims to allow direct comparison of sources in different subject fields.
top_impact_all = impact[['Source Title', '2011 SNIP2']].copy()
top_impact_all = pd.DataFrame(top_impact_all.sort('2011 SNIP2', ascending=False).dropna(), columns=['Source Title', '2011 SNIP2'])
top_impact_all['2011 SJR2'] = impact['2011 SJR2']
top_impact_all['Difference'] = top_impact_all['2011 SNIP2'] - top_impact_all['2011 SJR2']
top_impact_all.head(15)
Source Title | 2011 SNIP2 | 2011 SJR2 | Difference | |
---|---|---|---|---|
4874 | CA - A Cancer Journal for Clinicians | 41.082 | 24.976 | 16.106 |
10365 | Foundations and Trends in Information Retrieval | 32.028 | 10.411 | 21.617 |
26384 | Reviews of Modern Physics | 22.129 | 36.194 | -14.065 |
110 | ACM Computing Surveys | 17.848 | 9.926 | 7.922 |
16434 | Journal of Engineering Education | 16.072 | 1.358 | 14.714 |
22052 | New England Journal of Medicine | 14.971 | 9.740 | 5.231 |
25177 | Progress in Materials Science | 13.535 | 10.127 | 3.408 |
2280 | Annual Review of Psychology | 12.013 | 8.137 | 3.876 |
1004 | Advances in Physics | 11.750 | 26.216 | -14.466 |
12309 | IEEE Communications Surveys and Tutorials | 11.584 | 6.315 | 5.269 |
23931 | Physics Reports | 11.363 | 10.761 | 0.602 |
5538 | Chemical Reviews | 11.350 | 15.866 | -4.516 |
16321 | Journal of Economic Literature | 10.738 | 13.121 | -2.383 |
2257 | Annual Review of Immunology | 10.680 | 31.166 | -20.486 |
25160 | Progress in Energy and Combustion Science | 10.579 | 6.430 | 4.149 |
BY SJR
SJR, or SCImago Journal Rank, is a measure of the scientific prestige of scholarly sources.
SJR assigns relative scores to all of the sources in a citation network. Its methodology is inspired by the Google PageRank algorithm, in that not all citations are equal. A source transfers its own 'prestige', or status, to another source through the act of citing it. A citation from a source with a relatively high SJR is worth more than a citation from a source with a lower SJR.
top_impact_all.sort('2011 SJR2', ascending=False).dropna().head(15)
Source Title | 2011 SNIP2 | 2011 SJR2 | Difference | |
---|---|---|---|---|
26384 | Reviews of Modern Physics | 22.129 | 36.194 | -14.065 |
2257 | Annual Review of Immunology | 10.680 | 31.166 | -20.486 |
1004 | Advances in Physics | 11.750 | 26.216 | -14.466 |
4874 | CA - A Cancer Journal for Clinicians | 41.082 | 24.976 | 16.106 |
2232 | Annual Review of Biochemistry | 8.276 | 23.856 | -15.580 |
21729 | Nature Genetics | 7.211 | 19.919 | -12.708 |
5311 | Cell | 6.579 | 19.779 | -13.200 |
2265 | Annual Review of Neuroscience | 7.609 | 17.014 | -9.405 |
2254 | Annual Review of Genetics | 5.113 | 16.628 | -11.515 |
25646 | Quarterly Journal of Economics | 6.621 | 16.230 | -9.609 |
5538 | Chemical Reviews | 11.350 | 15.866 | -4.516 |
21732 | Nature Materials | 7.960 | 15.413 | -7.453 |
2276 | Annual Review of Plant Biology | 10.257 | 14.740 | -4.483 |
21709 | Nature | 8.647 | 14.548 | -5.901 |
21731 | Nature Immunology | 3.912 | 14.286 | -10.374 |
open_lang = open_access.Language.value_counts().head(10)/len(open_access)*100
open_lang
English 55.810166 Spanish 6.036990 Portuguese 3.315110 Spanish, English 3.222054 English, French 1.395836 Portuguese, English 1.291148 Portuguese, Spanish 1.116669 English, Spanish 1.058509 French 1.046877 English 0.849133
open_lang.plot(kind='bar', title='Most common languages, open source journals (%)', color='green', alpha=.3);
open_lang = open_access.Keyword.value_counts().head(15)/len(open_access)*100
open_lang
health sciences 0.988717 medicine 0.523438 education 0.442015 mathematics 0.418751 psychology 0.348959 medical sciences 0.348959 human sciences 0.325695 philosophy 0.279167 biological sciences 0.255903 social sciences 0.232639 chemistry 0.209375 agricultural sciences 0.209375 public health 0.162848 law 0.162848 physics 0.151216
open_lang.plot(kind='bar', title='Most common keywords, open source journals (%)', color='green', alpha=.3);
plt.figure()
open_access['Start Year'].hist(range=(1980, 2012), bins=30, color='green', alpha=.3)
plt.title('Histogram of start year, open access journals');
timeline = open_access.sort('Start Year')
timeline[['Title', 'Start Year', 'End Year']].head(10)
Title | Start Year | End Year | |
---|---|---|---|
914 | Bijdragen Tot de Taal-, Land- en Volkenkunde van Ne | 1853 | 1948 |
6652 | Psyche : A Journal of Entomology | 1874 | NaN |
2616 | Fishery Bulletin | 1881 | NaN |
1222 | Bulletin of the American Museum of Natural Histor | 1881 | NaN |
7868 | South African Medical Journal | 1884 | NaN |
1225 | Bulletin of the Geological Society of Denmark | 1894 | NaN |
5544 | Memórias do Instituto Oswaldo Cruz. | 1909 | NaN |
4561 | Journal of Genetics | 1910 | NaN |
1232 | Bulletin of the Medical Library Association | 1911 | 2001 |
5773 | Nieuwe West-Indische Gids | 1919 | 1991 |
fee = open_access['Publication fee'].value_counts()/len(open_access)*100
fee
No 66.418518 Yes 27.939979 Conditional 2.954519 Information missing 2.512504
fee.plot(kind='bar', title='Histogram of fee required (%)', color='green', alpha=.3);
There are 6,059 open access journals that do not have a ranking, and will therefore not be included in the analysis.
len(open_access) - len(matches)
6059
closed_field = closed[['Physical sciences', 'Life sciences', 'Social sciences']].count()/len(closed)
open_field = matches[['Physical sciences', 'Life sciences', 'Social sciences']].count()/len(matches)
closed_field.plot(color='green', kind='bar', alpha=.3);
open_field.plot(color='blue', kind='bar', alpha=.4, title='Comparison of discipline (%)\nGreen=closed, purple=OA');
countries = pd.DataFrame(closed.Country.value_counts(), columns=['closed'])
countries['open'] = open_access.Country.value_counts()
countries['proportion_oa'] = countries['open']/countries['closed']
countries_sorted = countries.sort('proportion_oa', ascending=False)
countries_sorted[countries_sorted.proportion_oa >= 0][:10]
closed | open | proportion_oa | |
---|---|---|---|
Colombia | 6 | 208 | 34.666667 |
Costa Rica | 1 | 26 | 26.000000 |
Egypt | 20 | 351 | 17.550000 |
Chile | 17 | 142 | 8.352941 |
Indonesia | 6 | 45 | 7.500000 |
Brazil | 124 | 806 | 6.500000 |
Peru | 5 | 29 | 5.800000 |
Cuba | 9 | 50 | 5.555556 |
Venezuela | 16 | 85 | 5.312500 |
Tunisia | 2 | 10 | 5.000000 |
countries.closed.head(10).plot(color='green', kind='bar', alpha=.3)
countries.open.head(10).plot(color='blue', kind='bar', alpha=.4, title='Comparison of top journal producers (%)\nGreen=closed, purple=OA');
countries_sorted
<class 'pandas.core.frame.DataFrame'> Index: 115 entries, Mali to Netherlands Data columns: closed 115 non-null values open 91 non-null values proportion_oa 91 non-null values dtypes: float64(2), int64(1)
countries_sorted.proportion_oa[countries_sorted.proportion_oa > 0].head(10).plot(kind='bar', color ='m', rot=30,
title ='Countries with highest proportion of OA journals (%)');
snip_dist = pd.DataFrame(closed['2011 SNIP2'], columns=['2011 closed SNIP'])
snip_dist['2011 open SNIP'] = matches['2011 SNIP2']
All SNIP scores (outliers clipped)
snip_dist[snip_dist['2011 closed SNIP'] <15].boxplot(sym='m+');
snip_dist.describe()
2011 closed SNIP | 2011 open SNIP | |
---|---|---|
count | 16853.000000 | 1945.000000 |
mean | 0.826404 | 0.567863 |
std | 0.978531 | 0.516188 |
min | 0.000000 | 0.000000 |
25% | 0.259000 | 0.197000 |
50% | 0.668000 | 0.476000 |
75% | 1.115000 | 0.811000 |
max | 41.082000 | 4.814000 |
sjr_dist = pd.DataFrame(closed['2011 SJR2'], columns=['2011 closed SJR'])
sjr_dist['2011 open SJR'] = matches['2011 SJR2']
Outliers clipped
sjr_dist[sjr_dist['2011 closed SJR']<15].boxplot(sym='m+');
sjr_dist.describe()
2011 closed SJR | 2011 open SJR | |
---|---|---|
count | 17131.000000 | 2005.000000 |
mean | 0.636056 | 0.380854 |
std | 1.137223 | 0.543216 |
min | 0.000000 | 0.000000 |
25% | 0.139000 | 0.128000 |
50% | 0.315000 | 0.217000 |
75% | 0.739000 | 0.414000 |
max | 36.194000 | 7.581000 |
open_years = matches[['1999 SNIP2', '2000 SNIP2', '2001 SNIP2', '2002 SNIP2', '2003 SNIP2', '2004 SNIP2', '2005 SNIP2', '2006 SNIP2',
'2007 SNIP2', '2008 SNIP2', '2009 SNIP2', '2010 SNIP2', '2011 SNIP2']]
closed_years = closed[['1999 SNIP2', '2000 SNIP2', '2001 SNIP2', '2002 SNIP2', '2003 SNIP2', '2004 SNIP2', '2005 SNIP2', '2006 SNIP2',
'2007 SNIP2', '2008 SNIP2', '2009 SNIP2', '2010 SNIP2', '2011 SNIP2']]
open_years.mean().plot(style='g');
closed_years.mean().plot(style='--', title='Mean SNIP score\nGreen=closed journals\nBlue=open journals', rot=30);
clean = pd.DataFrame(open_years.mean(), columns=['open'])
clean['closed'] = closed_years.mean()
clean['diff'] = clean.closed - clean.open
clean
open | closed | diff | |
---|---|---|---|
1999 SNIP2 | 0.445412 | 0.805719 | 0.360307 |
2000 SNIP2 | 0.468380 | 0.790728 | 0.322347 |
2001 SNIP2 | 0.383813 | 0.721381 | 0.337568 |
2002 SNIP2 | 0.400653 | 0.702516 | 0.301862 |
2003 SNIP2 | 0.456816 | 0.741005 | 0.284189 |
2004 SNIP2 | 0.502747 | 0.769588 | 0.266842 |
2005 SNIP2 | 0.526888 | 0.791959 | 0.265072 |
2006 SNIP2 | 0.553230 | 0.777087 | 0.223857 |
2007 SNIP2 | 0.502577 | 0.767240 | 0.264663 |
2008 SNIP2 | 0.512368 | 0.770808 | 0.258440 |
2009 SNIP2 | 0.516095 | 0.785189 | 0.269094 |
2010 SNIP2 | 0.509438 | 0.792834 | 0.283396 |
2011 SNIP2 | 0.564373 | 0.826404 | 0.262031 |
clean['diff'].plot(title='Closed mean - open mean', rot=30, style='m');
matches[['1999 SNIP2', '2011 SNIP2']].describe()
1999 SNIP2 | 2011 SNIP2 | |
---|---|---|
count | 459.000000 | 2134.000000 |
mean | 0.445412 | 0.564373 |
std | 0.484183 | 0.509639 |
min | 0.000000 | 0.000000 |
25% | 0.147000 | 0.201250 |
50% | 0.334000 | 0.474500 |
75% | 0.592500 | 0.807000 |
max | 3.797000 | 4.814000 |
open_years_sjr = matches[['1999 SJR2', '2000 SJR2', '2001 SJR2', '2002 SJR2', '2003 SJR2', '2004 SJR2', '2005 SJR2', '2006 SJR2',
'2007 SJR2', '2008 SJR2', '2009 SJR2', '2010 SJR2', '2011 SJR2']]
closed_years_sjr = closed[['1999 SJR2', '2000 SJR2', '2001 SJR2', '2002 SJR2', '2003 SJR2', '2004 SJR2', '2005 SJR2', '2006 SJR2',
'2007 SJR2', '2008 SJR2', '2009 SJR2', '2010 SJR2', '2011 SJR2']]
open_years_sjr.mean().plot(style='green');
closed_years_sjr.mean().plot(style='--', title='Mean SJR score\nGreen=closed journals\nBlue=open journals', rot=30);
sjr_diff = closed_years_sjr.mean() - open_years_sjr.mean()
sjr_diff
1999 SJR2 0.281074 2000 SJR2 0.284787 2001 SJR2 0.285144 2002 SJR2 0.283726 2003 SJR2 0.290419 2004 SJR2 0.290213 2005 SJR2 0.290333 2006 SJR2 0.302838 2007 SJR2 0.308960 2008 SJR2 0.302925 2009 SJR2 0.294260 2010 SJR2 0.278882 2011 SJR2 0.257213
sjr_diff.plot(title='Closed journal SJR advantage over open journal', rot=30, ylim=(.0, .75), style='m');
matches[['1999 SJR2', '2011 SJR2']].describe()
1999 SJR2 | 2011 SJR2 | |
---|---|---|
count | 2201.000000 | 2201.000000 |
mean | 0.077542 | 0.378843 |
std | 0.300900 | 0.542297 |
min | 0.000000 | 0.000000 |
25% | 0.000000 | 0.128000 |
50% | 0.000000 | 0.216000 |
75% | 0.101000 | 0.408000 |
max | 6.916000 | 7.581000 |