This code accompanies this blog post.
Source data is available at Journal Metrics and at Directory of Open Access Journals.
Please not that I have not been able to sort the open access and closed access journals perfectly (see authors note below). There are likely entries that appear in both databases. If you find the bug, please let me know so I can update the results. This bug will affect the results, but I don't expect the influence to be dramatic.
If you reuse any code or figures, please credit Caitlin Rivers and include a link to my work. I'd also love if you let me know @cmyeaton.
Open access plots are in green, closed access plots are in blue/purple, and magenta are some combination or comparison.

In [119]:

from __future__ import division
import pandas as pd
pd.set_printoptions(max_rows=100, max_columns=10)
from scipy import stats
import matplotlib.pyplot as plt

In [2]:

impact = pd.read_csv('SNIP_SJR_complete_1999_2011new_SNIP_and_SJR_v1_Oct_2012.csv')
open_access = pd.read_csv('open_access_journals.csv')

The ISSN data are somewhat messy. The open_access ISSN numbers have a - in the middle. Some of the impact ISSNs have trailing spaces.

In [3]:

def rm_issn_punc(x):
    import re
    punc = re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ", x)
    space = ''.join(punc.split(" "))
    return space


def strip_space(x):
    try:
        new = int(y)
    except:
        new = str(x).strip(' ') 
    return new


def membership(x):
    blean = True
    open_lst = np.array(open_access.issn_)
    if x in open_lst:
        blean = False
    return blean


open_access['issn'] = open_access['ISSN'].map(rm_issn_punc)
open_access['issn_'] = open_access['issn'].map(strip_space)
impact['issn'] = impact['Print ISSN'].map(strip_space)
impact['closed'] = impact['issn'].map(membership)
closed = impact[impact.closed == True]

Merged the open access list and the impact db by issn number to avoid character encoding differences (e.g. see row 18 below)

In [4]:

matches = pd.merge(impact, open_access, left_on=impact['issn'], right_on=['issn']).drop_duplicates()
matches = matches.dropna(how='all')
matches = matches.drop_duplicates(cols='Title')
matches[['Source Title', 'Title']].head(20)

Out[4]:

	Source Title	Title
0	Revista de Microbiologia	Revista de Microbiologia
1	Anais da Academia Brasileira de Ciencias	Anais da Academia Brasileira de Ciencias.
2	Acta Adriatica	Acta Adriatica
3	Acta Biochimica Polonica	Acta Biochimica Polonica
4	Acta Biologica Cracoviensia Series Botanica	Acta Biologica Cracoviensia Series Botanica
5	Acta Dermato-Venereologica	Acta Dermato-Venereologica
6	Acta Geologica Polonica	Acta Geologica Polonica
7	Acta Medica Nagasakiensia	Acta Medica Nagasakiensia
8	Acta Palaeobotanica	Acta Palaeobotanica
9	Acta Societatis Botanicorum Poloniae	Acta Societatis Botanicorum Poloniae
10	Acta Stomatologica Croatica	Acta Stomatologica Croatica
11	Acta Veterinaria Brno	Acta Veterinaria Brno
12	Acta Zoologica Sinica	Acta Zoologica Sinica
13	Afrika Spectrum	Africa Spectrum
14	American Journal of Pharmaceutical Education	American Journal of Pharmaceutical Education
15	Notices of the American Mathematical Society	Notices of the American Mathematical Society.
16	American Museum Novitates	American Museum Novitates
17	Bulletin of the American Museum of Natural Histor	Bulletin of the American Museum of Natural Histor
18	Analise Social	AnÃ¡lise Social
19	Angle Orthodontist	Angle Orthodontist

Author's note:¶

closed.issn + open_access.issn should == impact.issn, but it doesn't. There is clearly a bug somewhere, but I can't find it even after extensive searching. I suspect there is an underlying problem with the format of the ISSNs. Feel free to take a crack at it.

There are likely some open access journals in the closed category, which may make difference between the two categories less stark than they may otherwise be.

In [5]:

print 'Closed: {0}, Open: {1}, Full list: {2}'.format(len(closed.issn.unique()), len(open_access.issn_.unique()), len(impact.issn.unique()))

Closed: 28235, Open: 8597, Full list: 30774

Top impact journals overall¶

definitions from http://www.journalmetrics.com/faq.php

BY SNIP

SNIP, or Source-Normalized Impact per Paper, measures a source’s contextual citation impact. It takes into account characteristics of the source's subject field, especially the frequency at which authors cite other papers in their reference lists, the speed at which citation impact matures, and the extent to which the database used in the assessment covers the field’s literature. SNIP is the ratio of a source's average citation count per paper, and the ‘citation potential’ of its subject field. It aims to allow direct comparison of sources in different subject fields.

In [6]:

top_impact_all = impact[['Source Title', '2011 SNIP2']].copy()
top_impact_all = pd.DataFrame(top_impact_all.sort('2011 SNIP2', ascending=False).dropna(), columns=['Source Title', '2011 SNIP2'])
top_impact_all['2011 SJR2'] = impact['2011 SJR2']
top_impact_all['Difference'] = top_impact_all['2011 SNIP2'] - top_impact_all['2011 SJR2']

In [7]:

top_impact_all.head(15)

Out[7]:

	Source Title	2011 SNIP2	2011 SJR2	Difference
4874	CA - A Cancer Journal for Clinicians	41.082	24.976	16.106
10365	Foundations and Trends in Information Retrieval	32.028	10.411	21.617
26384	Reviews of Modern Physics	22.129	36.194	-14.065
110	ACM Computing Surveys	17.848	9.926	7.922
16434	Journal of Engineering Education	16.072	1.358	14.714
22052	New England Journal of Medicine	14.971	9.740	5.231
25177	Progress in Materials Science	13.535	10.127	3.408
2280	Annual Review of Psychology	12.013	8.137	3.876
1004	Advances in Physics	11.750	26.216	-14.466
12309	IEEE Communications Surveys and Tutorials	11.584	6.315	5.269
23931	Physics Reports	11.363	10.761	0.602
5538	Chemical Reviews	11.350	15.866	-4.516
16321	Journal of Economic Literature	10.738	13.121	-2.383
2257	Annual Review of Immunology	10.680	31.166	-20.486
25160	Progress in Energy and Combustion Science	10.579	6.430	4.149

BY SJR

SJR, or SCImago Journal Rank, is a measure of the scientific prestige of scholarly sources.

SJR assigns relative scores to all of the sources in a citation network. Its methodology is inspired by the Google PageRank algorithm, in that not all citations are equal. A source transfers its own 'prestige', or status, to another source through the act of citing it. A citation from a source with a relatively high SJR is worth more than a citation from a source with a lower SJR.

In [8]:

top_impact_all.sort('2011 SJR2', ascending=False).dropna().head(15)

Out[8]:

	Source Title	2011 SNIP2	2011 SJR2	Difference
26384	Reviews of Modern Physics	22.129	36.194	-14.065
2257	Annual Review of Immunology	10.680	31.166	-20.486
1004	Advances in Physics	11.750	26.216	-14.466
4874	CA - A Cancer Journal for Clinicians	41.082	24.976	16.106
2232	Annual Review of Biochemistry	8.276	23.856	-15.580
21729	Nature Genetics	7.211	19.919	-12.708
5311	Cell	6.579	19.779	-13.200
2265	Annual Review of Neuroscience	7.609	17.014	-9.405
2254	Annual Review of Genetics	5.113	16.628	-11.515
25646	Quarterly Journal of Economics	6.621	16.230	-9.609
5538	Chemical Reviews	11.350	15.866	-4.516
21732	Nature Materials	7.960	15.413	-7.453
2276	Annual Review of Plant Biology	10.257	14.740	-4.483
21709	Nature	8.647	14.548	-5.901
21731	Nature Immunology	3.912	14.286	-10.374

Language of open access journals¶

In [60]:

open_lang = open_access.Language.value_counts().head(10)/len(open_access)*100
open_lang

Out[60]:

English                55.810166
Spanish                 6.036990
Portuguese              3.315110
Spanish, English        3.222054
English, French         1.395836
Portuguese, English     1.291148
Portuguese, Spanish     1.116669
English, Spanish        1.058509
French                  1.046877
English                 0.849133

In [61]:

open_lang.plot(kind='bar', title='Most common languages, open source journals (%)', color='green', alpha=.3);

Most common keywords in open access journals¶

In [68]:

open_lang = open_access.Keyword.value_counts().head(15)/len(open_access)*100
open_lang

Out[68]:

health sciences          0.988717
medicine                 0.523438
education                0.442015
mathematics              0.418751
psychology               0.348959
medical sciences         0.348959
human sciences           0.325695
philosophy               0.279167
biological sciences      0.255903
social sciences          0.232639
chemistry                0.209375
agricultural sciences    0.209375
public health            0.162848
law                      0.162848
physics                  0.151216

In [63]:

open_lang.plot(kind='bar', title='Most common keywords, open source journals (%)', color='green', alpha=.3);

Breakdown by start year¶

In [38]:

plt.figure()
open_access['Start Year'].hist(range=(1980, 2012), bins=30, color='green', alpha=.3)
plt.title('Histogram of start year, open access journals');

Oldest OA journals¶

In [13]:

timeline = open_access.sort('Start Year')
timeline[['Title', 'Start Year', 'End Year']].head(10)

Out[13]:

	Title	Start Year	End Year
914	Bijdragen Tot de Taal-, Land- en Volkenkunde van Ne	1853	1948
6652	Psyche : A Journal of Entomology	1874	NaN
2616	Fishery Bulletin	1881	NaN
1222	Bulletin of the American Museum of Natural Histor	1881	NaN
7868	South African Medical Journal	1884	NaN
1225	Bulletin of the Geological Society of Denmark	1894	NaN
5544	MemÃ³rias do Instituto Oswaldo Cruz.	1909	NaN
4561	Journal of Genetics	1910	NaN
1232	Bulletin of the Medical Library Association	1911	2001
5773	Nieuwe West-Indische Gids	1919	1991

Fees¶

In [14]:

fee = open_access['Publication fee'].value_counts()/len(open_access)*100
fee

Out[14]:

No                     66.418518
Yes                    27.939979
Conditional             2.954519
Information missing     2.512504

In [37]:

fee.plot(kind='bar', title='Histogram of fee required (%)', color='green', alpha=.3);

Comparing open and closed access¶

There are 6,059 open access journals that do not have a ranking, and will therefore not be included in the analysis.

In [16]:

len(open_access) - len(matches)

Out[16]:

By discipline¶

In [17]:

closed_field = closed[['Physical sciences', 'Life sciences', 'Social sciences']].count()/len(closed)
open_field = matches[['Physical sciences', 'Life sciences', 'Social sciences']].count()/len(matches)

In [58]:

closed_field.plot(color='green', kind='bar', alpha=.3);
open_field.plot(color='blue', kind='bar', alpha=.4, title='Comparison of discipline (%)\nGreen=closed, purple=OA');

By country of origin¶

In [82]:

countries = pd.DataFrame(closed.Country.value_counts(), columns=['closed'])
countries['open'] = open_access.Country.value_counts()
countries['proportion_oa'] = countries['open']/countries['closed']

In [115]:

countries_sorted = countries.sort('proportion_oa', ascending=False)
countries_sorted[countries_sorted.proportion_oa >= 0][:10]

Out[115]:

	closed	open	proportion_oa
Colombia	6	208	34.666667
Costa Rica	1	26	26.000000
Egypt	20	351	17.550000
Chile	17	142	8.352941
Indonesia	6	45	7.500000
Brazil	124	806	6.500000
Peru	5	29	5.800000
Cuba	9	50	5.555556
Venezuela	16	85	5.312500
Tunisia	2	10	5.000000

In [95]:

countries.closed.head(10).plot(color='green', kind='bar', alpha=.3)
countries.open.head(10).plot(color='blue', kind='bar', alpha=.4, title='Comparison of top journal producers (%)\nGreen=closed, purple=OA');

In [102]:

countries_sorted

Out[102]:

<class 'pandas.core.frame.DataFrame'>
Index: 115 entries, Mali to Netherlands
Data columns:
closed           115  non-null values
open             91  non-null values
proportion_oa    91  non-null values
dtypes: float64(2), int64(1)

In [110]:

countries_sorted.proportion_oa[countries_sorted.proportion_oa > 0].head(10).plot(kind='bar', color ='m', rot=30, 
title ='Countries with highest proportion of OA journals (%)');

Comparing SNIP distribution¶

In [21]:

snip_dist = pd.DataFrame(closed['2011 SNIP2'], columns=['2011 closed SNIP'])
snip_dist['2011 open SNIP'] = matches['2011 SNIP2']

All SNIP scores (outliers clipped)

In [54]:

snip_dist[snip_dist['2011 closed SNIP'] <15].boxplot(sym='m+');

In [23]:

snip_dist.describe()

Out[23]:

	2011 closed SNIP	2011 open SNIP
count	16853.000000	1945.000000
mean	0.826404	0.567863
std	0.978531	0.516188
min	0.000000	0.000000
25%	0.259000	0.197000
50%	0.668000	0.476000
75%	1.115000	0.811000
max	41.082000	4.814000

Comparing SJR distributions¶

In [24]:

sjr_dist = pd.DataFrame(closed['2011 SJR2'], columns=['2011 closed SJR'])
sjr_dist['2011 open SJR'] = matches['2011 SJR2']

Outliers clipped

In [53]:

sjr_dist[sjr_dist['2011 closed SJR']<15].boxplot(sym='m+');

In [26]:

sjr_dist.describe()

Out[26]:

	2011 closed SJR	2011 open SJR
count	17131.000000	2005.000000
mean	0.636056	0.380854
std	1.137223	0.543216
min	0.000000	0.000000
25%	0.139000	0.128000
50%	0.315000	0.217000
75%	0.739000	0.414000
max	36.194000	7.581000

Open access impact over time¶

In [27]:

open_years = matches[['1999 SNIP2', '2000 SNIP2', '2001 SNIP2', '2002 SNIP2', '2003 SNIP2', '2004 SNIP2', '2005 SNIP2', '2006 SNIP2',
                      '2007 SNIP2', '2008 SNIP2', '2009 SNIP2', '2010 SNIP2', '2011 SNIP2']]
closed_years = closed[['1999 SNIP2', '2000 SNIP2', '2001 SNIP2', '2002 SNIP2', '2003 SNIP2', '2004 SNIP2', '2005 SNIP2', '2006 SNIP2',
                      '2007 SNIP2', '2008 SNIP2', '2009 SNIP2', '2010 SNIP2', '2011 SNIP2']]

In [42]:

open_years.mean().plot(style='g');
closed_years.mean().plot(style='--', title='Mean SNIP score\nGreen=closed journals\nBlue=open journals', rot=30);

In [142]:

clean = pd.DataFrame(open_years.mean(), columns=['open'])
clean['closed'] = closed_years.mean()
clean['diff'] = clean.closed - clean.open
clean

Out[142]:

	open	closed	diff
1999 SNIP2	0.445412	0.805719	0.360307
2000 SNIP2	0.468380	0.790728	0.322347
2001 SNIP2	0.383813	0.721381	0.337568
2002 SNIP2	0.400653	0.702516	0.301862
2003 SNIP2	0.456816	0.741005	0.284189
2004 SNIP2	0.502747	0.769588	0.266842
2005 SNIP2	0.526888	0.791959	0.265072
2006 SNIP2	0.553230	0.777087	0.223857
2007 SNIP2	0.502577	0.767240	0.264663
2008 SNIP2	0.512368	0.770808	0.258440
2009 SNIP2	0.516095	0.785189	0.269094
2010 SNIP2	0.509438	0.792834	0.283396
2011 SNIP2	0.564373	0.826404	0.262031

In [140]:

clean['diff'].plot(title='Closed mean - open mean', rot=30, style='m');

In [30]:

matches[['1999 SNIP2', '2011 SNIP2']].describe()

Out[30]:

	1999 SNIP2	2011 SNIP2
count	459.000000	2134.000000
mean	0.445412	0.564373
std	0.484183	0.509639
min	0.000000	0.000000
25%	0.147000	0.201250
50%	0.334000	0.474500
75%	0.592500	0.807000
max	3.797000	4.814000

In [31]:

open_years_sjr = matches[['1999 SJR2', '2000 SJR2', '2001 SJR2', '2002 SJR2', '2003 SJR2', '2004 SJR2', '2005 SJR2', '2006 SJR2',
                      '2007 SJR2', '2008 SJR2', '2009 SJR2', '2010 SJR2', '2011 SJR2']]
closed_years_sjr = closed[['1999 SJR2', '2000 SJR2', '2001 SJR2', '2002 SJR2', '2003 SJR2', '2004 SJR2', '2005 SJR2', '2006 SJR2',
                      '2007 SJR2', '2008 SJR2', '2009 SJR2', '2010 SJR2', '2011 SJR2']]

In [49]:

open_years_sjr.mean().plot(style='green');
closed_years_sjr.mean().plot(style='--', title='Mean SJR score\nGreen=closed journals\nBlue=open journals', rot=30);

In [145]:

sjr_diff = closed_years_sjr.mean() - open_years_sjr.mean()
sjr_diff

Out[145]:

1999 SJR2    0.281074
2000 SJR2    0.284787
2001 SJR2    0.285144
2002 SJR2    0.283726
2003 SJR2    0.290419
2004 SJR2    0.290213
2005 SJR2    0.290333
2006 SJR2    0.302838
2007 SJR2    0.308960
2008 SJR2    0.302925
2009 SJR2    0.294260
2010 SJR2    0.278882
2011 SJR2    0.257213

In [146]:

sjr_diff.plot(title='Closed journal SJR advantage over open journal', rot=30, ylim=(.0, .75), style='m');

In [34]:

matches[['1999 SJR2', '2011 SJR2']].describe()

Out[34]:

	1999 SJR2	2011 SJR2
count	2201.000000	2201.000000
mean	0.077542	0.378843
std	0.300900	0.542297
min	0.000000	0.000000
25%	0.000000	0.128000
50%	0.000000	0.216000
75%	0.101000	0.408000
max	6.916000	7.581000