Code notes¶

This code accompanies this blog post.
Source data is available at Journal Metrics and at Directory of Open Access Journals.
If you reuse any code or figures, please credit Caitlin Rivers and include a link to my work. I'd also love if you let me know @cmyeaton.
Plot colors:
- open access: green
- closed access: blue/purple
- combination or comparison: magenta

In [89]:

from __future__ import division

import pandas as pd
#pd.set_printoptions(max_rows=100, max_columns=10)
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 10)
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

In [90]:

impact = pd.read_csv('../../SNIP_SJR_complete_1999_2011new_SNIP_and_SJR_v1_Oct_2012.csv')
open_access = pd.read_csv('../data/open_access_journals.csv')

The ISSN data are somewhat messy. The open_access ISSN numbers have a - in the middle. Some of the impact ISSNs have trailing spaces.

In [91]:

def rm_issn_punc(x):
    import re
    punc = re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ", x)
    space = ''.join(punc.split(" "))
    return space


def strip_space(x):
    try:
        new = int(x)
    except Exception as e:
        new = str(x).strip(' ') 
    
    return new


def membership(x):
    blean = True
    open_lst = np.array(open_access.issn)
    if x in open_lst:
        blean = False
    return blean


open_access['issn_'] = open_access['ISSN'].map(rm_issn_punc)
open_access['issn'] = open_access['issn_'].map(strip_space)

impact['issn'] = impact['Print ISSN'].map(strip_space)

impact.issn = impact.issn.replace('nan', np.nan)
open_access.issn = open_access.issn.replace('nan', np.nan)

impact = impact.drop_duplicates(cols='issn')
open_access = open_access.drop_duplicates(cols='issn')

impact['closed'] = impact['issn'].map(membership)
closed = impact[impact.closed != False]

In [92]:

impact.closed.value_counts()

Out[92]:

True     28235
False     2539

Merged the open access list and the impact db by issn number to avoid character encoding differences (e.g. see row 18 below)

In [93]:

matches = pd.merge(impact, open_access, left_on=impact['issn'], right_on=['issn']).drop_duplicates()
matches = matches.dropna(how='all')
matches = matches.drop_duplicates(cols='Title')
matches['Country'] = matches['Country_x']
matches[['Source Title', 'Title']].head(20)

Out[93]:

	Source Title	Title
0	AACL Bioflux	Aquaculture, Aquarium, Conservation & Legislation
1	Abstract and Applied Analysis	Abstract and Applied Analysis
2	Academia	Academia : Revista Latinoamericana de Administ...
3	ACIMED	ACIMED
4	ACME	ACME : An International e-Journal for Critical...
5	Acoustical Science and Technology	Acoustical Science and Technology
6	Acta Adriatica	Acta Adriatica
7	Acta Agriculturae Slovenica	Acta Agriculturae Slovenica
8	Acta Amazonica	Acta Amazonica
9	Acta Biochimica Polonica	Acta Biochimica Polonica
10	Acta Bioethica	Acta Bioethica
11	Acta Biologica Colombiana	Acta BiolÃ³gica Colombiana
12	Acta Biologica Cracoviensia Series Botanica	Acta Biologica Cracoviensia Series Botanica
13	Acta Bioquimica Clinica Latinoamericana	Acta BioquÃmica ClÃnica Latinoamericana
14	Acta Botanica Brasilica	Acta Botanica Brasilica
15	Acta Botanica Croatica	Acta Botanica Croatica
16	Acta Botanica Malacitana	Acta Botanica Malacitana
17	Acta Botanica Mexicana	Acta BotÃ¡nica Mexicana
18	Acta Botanica Venezuelica	Acta BotÃ¡nica VenezuÃ©lica
19	Acta Chimica Slovenica	Acta Chimica Slovenica

In [94]:

print 'Closed: {0}, Open: {1}, Full list: {2}'.format(len(closed.issn.dropna().unique()), len(open_access.issn.dropna().unique()), len(impact.issn.dropna().unique()))

Closed: 28234, Open: 8597, Full list: 30773

Top impact journals overall¶

definitions from http://www.journalmetrics.com/faq.php

BY SNIP

SNIP, or Source-Normalized Impact per Paper, measures a source’s contextual citation impact. It takes into account characteristics of the source's subject field, especially the frequency at which authors cite other papers in their reference lists, the speed at which citation impact matures, and the extent to which the database used in the assessment covers the field’s literature. SNIP is the ratio of a source's average citation count per paper, and the ‘citation potential’ of its subject field. It aims to allow direct comparison of sources in different subject fields.

In [95]:

top_impact_all = impact[['Source Title', '2011 SNIP2']].copy()
top_impact_all = pd.DataFrame(top_impact_all.sort('2011 SNIP2', ascending=False).dropna(), columns=['Source Title', '2011 SNIP2'])
top_impact_all['2011 SJR2'] = impact['2011 SJR2']
top_impact_all['Difference'] = top_impact_all['2011 SNIP2'] - top_impact_all['2011 SJR2']

In [96]:

top_impact_all.head(15)

Out[96]:

	Source Title	2011 SNIP2	2011 SJR2	Difference
4874	CA - A Cancer Journal for Clinicians	41.082	24.976	16.106
10365	Foundations and Trends in Information Retrieval	32.028	10.411	21.617
26384	Reviews of Modern Physics	22.129	36.194	-14.065
110	ACM Computing Surveys	17.848	9.926	7.922
16434	Journal of Engineering Education	16.072	1.358	14.714
22052	New England Journal of Medicine	14.971	9.740	5.231
25177	Progress in Materials Science	13.535	10.127	3.408
2280	Annual Review of Psychology	12.013	8.137	3.876
1004	Advances in Physics	11.750	26.216	-14.466
12309	IEEE Communications Surveys and Tutorials	11.584	6.315	5.269
23931	Physics Reports	11.363	10.761	0.602
5538	Chemical Reviews	11.350	15.866	-4.516
16321	Journal of Economic Literature	10.738	13.121	-2.383
2257	Annual Review of Immunology	10.680	31.166	-20.486
25160	Progress in Energy and Combustion Science	10.579	6.430	4.149

BY SJR

SJR, or SCImago Journal Rank, is a measure of the scientific prestige of scholarly sources.

SJR assigns relative scores to all of the sources in a citation network. Its methodology is inspired by the Google PageRank algorithm, in that not all citations are equal. A source transfers its own 'prestige', or status, to another source through the act of citing it. A citation from a source with a relatively high SJR is worth more than a citation from a source with a lower SJR.

In [97]:

top_impact_all.sort('2011 SJR2', ascending=False).dropna().head(15)

Out[97]:

	Source Title	2011 SNIP2	2011 SJR2	Difference
26384	Reviews of Modern Physics	22.129	36.194	-14.065
2257	Annual Review of Immunology	10.680	31.166	-20.486
1004	Advances in Physics	11.750	26.216	-14.466
4874	CA - A Cancer Journal for Clinicians	41.082	24.976	16.106
2232	Annual Review of Biochemistry	8.276	23.856	-15.580
21729	Nature Genetics	7.211	19.919	-12.708
5311	Cell	6.579	19.779	-13.200
2265	Annual Review of Neuroscience	7.609	17.014	-9.405
2254	Annual Review of Genetics	5.113	16.628	-11.515
25646	Quarterly Journal of Economics	6.621	16.230	-9.609
5538	Chemical Reviews	11.350	15.866	-4.516
21732	Nature Materials	7.960	15.413	-7.453
2276	Annual Review of Plant Biology	10.257	14.740	-4.483
21709	Nature	8.647	14.548	-5.901
21731	Nature Immunology	3.912	14.286	-10.374

Language of open access journals¶

In [98]:

open_lang = open_access.Language.value_counts().head(10)/len(open_access)*100
open_lang

Out[98]:

English                55.810166
Spanish                 6.036990
Portuguese              3.315110
Spanish, English        3.222054
English, French         1.395836
Portuguese, English     1.291148
Portuguese, Spanish     1.116669
English, Spanish        1.058509
French                  1.046877
English                 0.849133

In [99]:

plt.figure()
open_lang.plot(kind='bar', title='Most common languages, open source journals (%)', color='green', alpha=.3);
plt.show()

Most common keywords in open access journals¶

In [100]:

open_lang = open_access.Keyword.value_counts().head(15)/len(open_access)*100
open_lang

Out[100]:

health sciences          0.988717
medicine                 0.523438
education                0.442015
mathematics              0.418751
medical sciences         0.348959
psychology               0.348959
human sciences           0.325695
philosophy               0.279167
biological sciences      0.255903
social sciences          0.232639
chemistry                0.209375
agricultural sciences    0.209375
law                      0.162848
public health            0.162848
engineering              0.151216

In [101]:

open_lang.plot(kind='bar', title='Most common keywords, open source journals (%)', color='green', alpha=.3);

In [102]:

open_access.to_csv('open_access.csv')

Breakdown by start year¶

In [103]:

plt.figure()
open_access['Start Year'].hist(range=(1980, 2012), bins=30, color='green', alpha=.3)
plt.title('Histogram of start year, open access journals');

Oldest OA journals¶

In [104]:

timeline = open_access.sort('Start Year')
timeline[['Title', 'Start Year', 'End Year']].head(10)

Out[104]:

	Title	Start Year	End Year
914	Bijdragen Tot de Taal-, Land- en Volkenkunde v...	1853	1948
6652	Psyche : A Journal of Entomology	1874	NaN
2616	Fishery Bulletin	1881	NaN
1222	Bulletin of the American Museum of Natural His...	1881	NaN
7868	South African Medical Journal	1884	NaN
1225	Bulletin of the Geological Society of Denmark	1894	NaN
5544	MemÃ³rias do Instituto Oswaldo Cruz.	1909	NaN
4561	Journal of Genetics	1910	NaN
1232	Bulletin of the Medical Library Association	1911	2001
5773	Nieuwe West-Indische Gids	1919	1991

Fees¶

In [105]:

fee = open_access['Publication fee'].value_counts()/len(open_access)*100
fee

Out[105]:

No                     66.418518
Yes                    27.939979
Conditional             2.954519
Information missing     2.512504

In [106]:

fee.plot(kind='bar', title='Histogram of fee required (%)', color='green', alpha=.3, rot=0);

Comparing open and closed access¶

There are 6,059 open access journals that do not have a ranking, and will therefore not be included in the analysis.

In [107]:

len(open_access) - len(matches)

Out[107]:

By discipline¶

In [108]:

closed_field = closed[['Physical sciences', 'Life sciences', 'Social sciences']].count()/len(closed)
open_field = matches[['Physical sciences', 'Life sciences', 'Social sciences']].count()/len(matches)

In [109]:

closed_field.plot(color='red', kind='bar', alpha=.7);

In [110]:

tmp = pd.DataFrame(open_field, columns=['Open access'])
tmp['Closed access'] = closed_field

In [111]:

plt.figure()
tmp.plot(kind='bar', color=['green', 'red'], alpha=.5, title='Comparison of discipline', rot=0);

By country of origin¶

In [112]:

countries = pd.DataFrame(closed.Country.value_counts(), columns=['closed'])
countries['open'] = open_access.Country.value_counts()
countries['proportion_oa'] = countries['open']/countries['closed']

In [113]:

countries_sorted = countries.sort('proportion_oa', ascending=False)
countries_sorted[countries_sorted.proportion_oa >= 0][:10]

Out[113]:

	closed	open	proportion_oa
Colombia	6	208	34.666667
Costa Rica	1	26	26.000000
Egypt	20	351	17.550000
Chile	17	142	8.352941
Indonesia	6	45	7.500000
Peru	4	29	7.250000
Cuba	7	50	7.142857
Brazil	121	806	6.661157
Venezuela	16	85	5.312500
Tunisia	2	10	5.000000

In [114]:

plt.figure()
countries.closed.head(10).plot(color='red', kind='bar', alpha=.5)
countries.open.head(10).plot(color='green', kind='bar', alpha=.6, title='Comparison of top journal producers (%)\nGreen=Open access, red=Closed access', rot=30);

In [115]:

countries_sorted

Out[115]:

<class 'pandas.core.frame.DataFrame'>
Index: 114 entries, Moldova, Republic of to Netherlands
Data columns:
closed           114  non-null values
open             91  non-null values
proportion_oa    91  non-null values
dtypes: float64(2), int64(1)

In [116]:

countries_sorted.proportion_oa[countries_sorted.proportion_oa > 0].head(10).plot(kind='bar', color ='g', rot=30, alpha=.5,
title ='Countries with highest proportion of OA journals (%)');

Comparing SNIP distribution¶

In [117]:

snip_dist = pd.DataFrame(closed['2011 SNIP2'], columns=['2011 closed SNIP'])
snip_dist['2011 open SNIP'] = matches['2011 SNIP2']

All SNIP scores (outliers clipped)

In [118]:

snip_dist[snip_dist['2011 closed SNIP'] <15].boxplot(sym='m+');

In [119]:

snip_dist.describe()

Out[119]:

	2011 closed SNIP	2011 open SNIP
count	16617.000000	1906.000000
mean	0.831351	0.570842
std	0.980973	0.520289
min	0.000000	0.000000
25%	0.266000	0.203000
50%	0.675000	0.480500
75%	1.118000	0.806750
max	41.082000	4.814000

Comparing SJR distributions¶

In [120]:

sjr_dist = pd.DataFrame(closed['2011 SJR2'], columns=['2011 closed SJR'])
sjr_dist['2011 open SJR'] = matches['2011 SJR2']

Outliers clipped

In [121]:

sjr_dist[sjr_dist['2011 closed SJR']<15].boxplot(sym='m+');

In [122]:

sjr_dist.describe()

Out[122]:

	2011 closed SJR	2011 open SJR
count	16881.000000	1970.000000
mean	0.639512	0.383745
std	1.142131	0.558075
min	0.000000	0.000000
25%	0.140000	0.128000
50%	0.319000	0.217000
75%	0.743000	0.408750
max	36.194000	7.581000

Open access impact over time¶

In [123]:

def find_snip(db):
    snip_out = db[['1999 SNIP2', '2000 SNIP2', '2001 SNIP2', '2002 SNIP2', '2003 SNIP2', '2004 SNIP2', '2005 SNIP2', '2006 SNIP2',
                      '2007 SNIP2', '2008 SNIP2', '2009 SNIP2', '2010 SNIP2', '2011 SNIP2']]
    
    return snip_out


def select_snip_by_era(country_db):
    snip = find_snip(country_db)
    
    era1 = snip[country_db['Start Year'] < 1996]
    era2 = snip[(country_db['Start Year'] >= 1996) & (country_db['Start Year'] <= 2001)]
    era3 = snip[country_db['Start Year'] > 2001]
    
    return snip, era1, era2, era3


def find_sjr(db):
    sjr_out = db[['1999 SJR2', '2000 SJR2', '2001 SJR2', '2002 SJR2', '2003 SJR2', '2004 SJR2', '2005 SJR2', '2006 SJR2',
                      '2007 SJR2', '2008 SJR2', '2009 SJR2', '2010 SJR2', '2011 SJR2']]
    return sjr_out


def select_sjr_by_era(country_db):
    sjr = find_sjr(country_db)

    era1 = sjr[country_db['Start Year'] < 1996]
    era2 = sjr[(country_db['Start Year'] >= 1996) & (country_db['Start Year'] <= 2001)]
    era3 = sjr[country_db['Start Year'] > 2001]
    
    return sjr, era1, era2, era3

In [124]:

open_years = find_snip(matches)
closed_years = find_snip(closed)

In [125]:

plt.figure()
open_years.median().plot(style='g');
closed_years.median().plot(style='--', title='Median SNIP score\nGreen = Open access\nRed = Closed access', rot=30, color='r');

In [126]:

clean = pd.DataFrame(open_years.median(), columns=['open'])
clean['closed'] = closed_years.median()
clean['median_difference'] = clean.closed - clean.open

In [127]:

plt.figure()
clean.median_difference.plot(title='Median difference (closed median - open median)', rot=30, style='m');

In [128]:

clean

Out[128]:

	open	closed	median_difference
1999 SNIP2	0.3340	0.6810	0.3470
2000 SNIP2	0.3640	0.6750	0.3110
2001 SNIP2	0.2895	0.6090	0.3195
2002 SNIP2	0.3040	0.5890	0.2850
2003 SNIP2	0.3310	0.6010	0.2700
2004 SNIP2	0.3890	0.6170	0.2280
2005 SNIP2	0.4330	0.6500	0.2170
2006 SNIP2	0.4600	0.6310	0.1710
2007 SNIP2	0.3970	0.6130	0.2160
2008 SNIP2	0.3890	0.6170	0.2280
2009 SNIP2	0.3990	0.6285	0.2295
2010 SNIP2	0.4040	0.6400	0.2360
2011 SNIP2	0.4745	0.6750	0.2005

In [129]:

matches[['1999 SNIP2', '2011 SNIP2']].describe()

Out[129]:

	1999 SNIP2	2011 SNIP2
count	459.000000	2134.000000
mean	0.445412	0.564373
std	0.484183	0.509639
min	0.000000	0.000000
25%	0.147000	0.201250
50%	0.334000	0.474500
75%	0.592500	0.807000
max	3.797000	4.814000

In [130]:

matches[['1999 SNIP2', '2011 SNIP2']].median()

Out[130]:

1999 SNIP2    0.3340
2011 SNIP2    0.4745

In [131]:

matches['1999 SNIP2'].hist()

Out[131]:

<matplotlib.axes.AxesSubplot at 0x10e5b1d90>

In [132]:

open_years_sjr = find_sjr(matches)
closed_years_sjr = find_sjr(closed)

In [133]:

open_years_sjr.median().plot(style='green');
closed_years_sjr.median().plot(style='--', title='Median SJR score\nGreen=closed journals\nBlue=open journals', rot=30);

In [134]:

sjr_diff = closed_years_sjr.median() - open_years_sjr.median()
sjr_diff

Out[134]:

1999 SJR2    0.113
2000 SJR2    0.121
2001 SJR2    0.126
2002 SJR2    0.131
2003 SJR2    0.139
2004 SJR2    0.151
2005 SJR2    0.169
2006 SJR2    0.193
2007 SJR2    0.112
2008 SJR2    0.123
2009 SJR2    0.125
2010 SJR2    0.109
2011 SJR2    0.103

In [135]:

sjr_diff.plot(title='Closed journal SJR advantage over open journal', rot=30, ylim=(.0, .75), style='m');

In [136]:

matches[['1999 SJR2', '2011 SJR2']].describe()

Out[136]:

	1999 SJR2	2011 SJR2
count	2201.000000	2201.000000
mean	0.077542	0.378843
std	0.300900	0.542297
min	0.000000	0.000000
25%	0.000000	0.128000
50%	0.000000	0.216000
75%	0.101000	0.408000
max	6.916000	7.581000

Part 2¶

In [137]:

country_list = ['United States', 'United Kingdom', 'Germany','Netherlands']
nonlst = pd.DataFrame(impact.Country.value_counts()[4:])
nonlst = list(nonlst.index)

oa_big4 = matches[matches.Country_x.isin(country_list)]
oa_nonbig4 = matches[matches.Country_x.isin(nonlst)]

Separate OA and closed by era, then by SJR and SNIP

In [138]:

oa_snip, oa_era1, oa_era2, oa_era3 = select_snip_by_era(oa_big4)
non_snip, non_era1, non_era2, non_era3 = select_snip_by_era(oa_nonbig4)

oasjr, oasjr1, oasjr2, oasjr3 = select_sjr_by_era(oa_big4)
nonsjr, nonsjr1, nonsjr2, nonsjr3 = select_sjr_by_era(oa_nonbig4)

Find mean SJR of all journals, by year, and by whether they are one of the big 4 countries.

In [139]:

median_sjr = pd.DataFrame([oasjr1['2011 SJR2'].median(), oasjr2['2011 SJR2'].median(), oasjr3['2011 SJR2'].median()])
median_sjr_col = median_sjr
tmp = median_sjr_col.rename(columns={0: 'Big4 Median SJR'}).T
median_sjr_big4 = tmp.rename(columns={0:'1996', 1: '1996-2001', 2:'2002-2011'})

In [140]:

OA_sjr_big4 = median_sjr_big4.T
OA_sjr_big4['Non-Big4 mean SJR'] = ([nonsjr1['2011 SJR2'].median(), nonsjr2['2011 SJR2'].median(), nonsjr3['2011 SJR2'].median()])
OA_sjr_big4.T

Out[140]:

	1996	1996-2001	2002-2011
Big4 Median SJR	0.489	0.5005	0.418
Non-Big4 mean SJR	0.223	0.1940	0.171

Newer journals appear to be less impactful in 2011 than those started before 1996

In [141]:

OA_sjr_big4.T.plot(kind='bar', rot=0,title='2011 median SJR of OA journals from Big 4, by start year\n Big 4 = USA, UK, Netherlands & Germany');

In [142]:

median_snip = pd.DataFrame([oa_era1['2011 SNIP2'].median(), oa_era2['2011 SNIP2'].median(), oa_era3['2011 SNIP2'].median()])
median_snip_col = median_snip
tmp = median_snip_col.rename(columns={0: 'Big4 Mean SNIP'}).T
median_snip_big4 = tmp.rename(columns={0:'1996', 1: '1996-2001', 2:'2002-2011'})

In [143]:

OA_big4 = median_snip_big4.T
OA_big4['Non-Big4 median SNIP'] = ([non_era1['2011 SNIP2'].median(), non_era2['2011 SNIP2'].median(), non_era3['2011 SNIP2'].median()])
OA_big4.T

Out[143]:

	1996	1996-2001	2002-2011
Big4 Mean SNIP	0.872	0.834	0.812
Non-Big4 median SNIP	0.554	0.403	0.340

In [144]:

OA_big4.T.plot(kind='bar', rot=0,title='Median SNIP of OA journals from Big 4, by start year\n Big 4 = USA, UK, Netherlands & Germany');

In [145]:

big4_closed = closed[closed.Country.isin(country_list)]
non_closed = closed[closed.Country.isin(nonlst)]

In [146]:

big4_closed_snip = find_snip(big4_closed)
non_closed_snip = find_snip(non_closed)

In [147]:

big4_closed_sjr = find_sjr(big4_closed)
non_closed_sjr = find_sjr(non_closed)

nevermind about start year, these analysis are all about access, country status, and SJR year.

Good news, Big4 OA journals have been shooting up in terms of impcat

In [148]:

plt.figure()
sjr_combo = pd.DataFrame([oasjr.median(), nonsjr.median(), big4_closed_sjr.median(), non_closed_sjr.median()]).T
sjr_combo = sjr_combo.rename(columns={0: 'Big4 OA', 1: 'Non-Big4 OA', 2: 'Big4 closed', 3: 'Non-Big4 closed'})
sjr_combo.plot(rot=30);
plt.show()

In [149]:

sjr_combo['OA diff'] = sjr_combo['Big4 closed'] - sjr_combo['Big4 OA']
sjr_combo['closed diff'] = sjr_combo['Non-Big4 closed'] - sjr_combo['Non-Big4 OA']
sjr_combo[['OA diff', 'closed diff']].plot(rot=30, title='SJR difference, OA and closed');

In [150]:

sjr_combo.describe()

Out[150]:

	Big4 OA	Non-Big4 OA	Big4 closed	Non-Big4 closed	OA diff	closed diff
count	13.000000	13.000000	13.000000	13.000000	13.000000	13.000000
mean	0.143077	0.051462	0.293769	0.101000	0.150692	0.049538
std	0.166178	0.070902	0.099030	0.049156	0.071100	0.053258
min	0.000000	0.000000	0.173000	0.000000	0.015000	-0.024000
25%	0.000000	0.000000	0.210000	0.101000	0.114000	0.000000
50%	0.106000	0.000000	0.264000	0.106000	0.166000	0.020000
75%	0.260000	0.104000	0.374000	0.124000	0.205000	0.101000
max	0.445000	0.186000	0.460000	0.162000	0.245000	0.111000

In [151]:

kind_combo = pd.DataFrame([oa_snip.median(), non_snip.median(), big4_closed_snip.median(), non_closed_snip.median()]).T
kind_combo = kind_combo.rename(columns={0: 'Big4 OA', 1: 'non-Big4 OA', 2: 'Big4 closed', 3: 'Non-Big4 closed'})

In [152]:

kind_combo.describe()

Out[152]:

	Big4 OA	non-Big4 OA	Big4 closed	Non-Big4 closed
count	13.000000	13.000000	13.000000	13.000000
mean	0.634885	0.302077	0.783808	0.247654
std	0.144327	0.041310	0.035933	0.042857
min	0.420000	0.239000	0.723000	0.201000
25%	0.522500	0.268000	0.770000	0.216500
50%	0.631000	0.312000	0.781000	0.236500
75%	0.759500	0.326500	0.809000	0.262000
max	0.823000	0.386000	0.847000	0.324000

In [153]:

kind_combo.to_csv("snip_origin_year.csv")

In [154]:

kind_combo.plot(rot = 30, title='Median SNIP by year, access status and country of origin');

In [155]:

kind_combo['OA diff'] = kind_combo['Big4 closed'] - kind_combo['Big4 OA']
kind_combo['closed diff'] = kind_combo['Non-Big4 closed'] - kind_combo['non-Big4 OA']
kind_combo[['OA diff', 'closed diff']].plot(rot = 30, title='SNIP difference, OA and closed');

In [156]:

def country_impact(df):

    impacts = df.groupby(by='Country')
    pop_countries = pd.DataFrame(impact.Country.value_counts(), columns=['Num_journals_total'])
    country_median = pd.DataFrame(impacts['2011 SNIP2'].median(), columns=['Median snip'])
    country_median['Num_journals_total'] = pop_countries.Num_journals_total
    country_median['Median SJR'] = impacts['2011 SJR2'].median()
    
    plt.figure()
    plt.scatter(x=log(country_median['Num_journals_total']), y=country_median['Median snip']);
    plt.scatter(x=log(country_median['Num_journals_total']), y=country_median['Median SJR'], c='y', marker='+');
    plt.title('Median impact by log of number of journals published in that country\nBlue is SNIP, yellow is SJR')
    
    clean = country_median.dropna(how='any')
    slope, intercept, r_value, p_value, std_err = stats.linregress(clean['Median snip'], clean['Num_journals_total'])
    print 'SNIP. r = {0}, p = {1}'.format(r_value, p_value)
    
    slope1, intercept1, r_value1, p_value1, std_err1 = stats.linregress(clean['Median SJR'], clean['Num_journals_total'])
    print 'SJR. r = {0}, p = {1}'.format(r_value1, p_value1)
    
    return country_median

For both open and closed access journals, the more journals that country has, the higher the average journal's impact rating in that country. In other words, for countries that have a lot of journals, each journal in that country is more impactful

In [157]:

open_access = country_impact(matches)

SNIP. r = 0.30092808042, p = 0.00704211226706
SJR. r = 0.388444556851, p = 0.000403882110151

In [158]:

closed_access = country_impact(closed)

SNIP. r = 0.466675569107, p = 3.1207456288e-06
SJR. r = 0.60358379053, p = 2.40787599202e-10

In [159]:

combo = pd.merge(open_access, closed_access, suffixes=('_open', '_closed'), left_on=open_access.index, right_on=closed_access.index)
combo = combo.set_index('key_0')
combo['snip_diff'] = combo['Median snip_open'] - combo['Median snip_closed']
combo['sjr_diff'] = combo['Median SJR_open'] - combo['Median SJR_closed']
for_plot = combo[combo['Num_journals_total_open'] >= 150]

On the whole, open access journals have greater impact than non-OA.

In [160]:

for_plot.snip_diff.plot(kind='bar', title='Median snip open - median snip closed\n(Negative means closed has higher snip)');

In [161]:

for_plot.sjr_diff.plot(kind='bar', title='SJR');

In [162]:

combo.snip_diff.hist(bins=30);

In [163]:

combo.snip_diff.describe()

Out[163]:

count    73.000000
mean      0.119822
std       0.253111
min      -0.900000
25%       0.035500
50%       0.139500
75%       0.272000
max       0.776000

In [164]:

combo.sjr_diff.hist(bins=30);

In [165]:

combo.sjr_diff.describe()

Out[165]:

count    74.000000
mean      0.033054
std       0.110953
min      -0.479000
25%      -0.004000
50%       0.028750
75%       0.075625
max       0.455000

In [166]:

combo.head()

Out[166]:

	Median snip_open	Num_journals_total_open	Median SJR_open	Median snip_closed	Num_journals_total_closed	Median SJR_closed	snip_diff	sjr_diff
key_0
Argentina	0.281	60	0.142	0.213	60	0.1235	0.068	0.0185
Australia	0.267	390	0.202	0.445	390	0.2260	-0.178	-0.0240
Austria	0.104	79	0.116	0.225	79	0.1240	-0.121	-0.0080
Bangladesh	0.734	22	0.261	0.227	22	0.1280	0.507	0.1330
Belgium	0.519	181	0.183	0.199	181	0.1245	0.320	0.0585

In [167]:

clean = combo.dropna(how='any')
slope1, intercept1, r_value1, p_value1, std_err1 = stats.linregress(clean['snip_diff'], clean['Num_journals_total_open'])
print r_value1, p_value1

-0.158138988222 0.181469148274

In [168]:

plt.figure()
plt.scatter(log(clean['Num_journals_total_open']), clean['snip_diff']);