This code accompanies this blog post.
Source data is available at Journal Metrics and at Directory of Open Access Journals.
If you reuse any code or figures, please credit Caitlin Rivers and include a link to my work. I'd also love if you let me know @cmyeaton.
Plot colors:
from __future__ import division
import pandas as pd
#pd.set_printoptions(max_rows=100, max_columns=10)
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 10)
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
impact = pd.read_csv('../../SNIP_SJR_complete_1999_2011new_SNIP_and_SJR_v1_Oct_2012.csv')
open_access = pd.read_csv('../data/open_access_journals.csv')
The ISSN data are somewhat messy. The open_access ISSN numbers have a - in the middle. Some of the impact ISSNs have trailing spaces.
def rm_issn_punc(x):
import re
punc = re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ", x)
space = ''.join(punc.split(" "))
return space
def strip_space(x):
try:
new = int(x)
except Exception as e:
new = str(x).strip(' ')
return new
def membership(x):
blean = True
open_lst = np.array(open_access.issn)
if x in open_lst:
blean = False
return blean
open_access['issn_'] = open_access['ISSN'].map(rm_issn_punc)
open_access['issn'] = open_access['issn_'].map(strip_space)
impact['issn'] = impact['Print ISSN'].map(strip_space)
impact.issn = impact.issn.replace('nan', np.nan)
open_access.issn = open_access.issn.replace('nan', np.nan)
impact = impact.drop_duplicates(cols='issn')
open_access = open_access.drop_duplicates(cols='issn')
impact['closed'] = impact['issn'].map(membership)
closed = impact[impact.closed != False]
impact.closed.value_counts()
True 28235 False 2539
Merged the open access list and the impact db by issn number to avoid character encoding differences (e.g. see row 18 below)
matches = pd.merge(impact, open_access, left_on=impact['issn'], right_on=['issn']).drop_duplicates()
matches = matches.dropna(how='all')
matches = matches.drop_duplicates(cols='Title')
matches['Country'] = matches['Country_x']
matches[['Source Title', 'Title']].head(20)
Source Title | Title | |
---|---|---|
0 | AACL Bioflux | Aquaculture, Aquarium, Conservation & Legislation |
1 | Abstract and Applied Analysis | Abstract and Applied Analysis |
2 | Academia | Academia : Revista Latinoamericana de Administ... |
3 | ACIMED | ACIMED |
4 | ACME | ACME : An International e-Journal for Critical... |
5 | Acoustical Science and Technology | Acoustical Science and Technology |
6 | Acta Adriatica | Acta Adriatica |
7 | Acta Agriculturae Slovenica | Acta Agriculturae Slovenica |
8 | Acta Amazonica | Acta Amazonica |
9 | Acta Biochimica Polonica | Acta Biochimica Polonica |
10 | Acta Bioethica | Acta Bioethica |
11 | Acta Biologica Colombiana | Acta Biológica Colombiana |
12 | Acta Biologica Cracoviensia Series Botanica | Acta Biologica Cracoviensia Series Botanica |
13 | Acta Bioquimica Clinica Latinoamericana | Acta BioquÃmica ClÃnica Latinoamericana |
14 | Acta Botanica Brasilica | Acta Botanica Brasilica |
15 | Acta Botanica Croatica | Acta Botanica Croatica |
16 | Acta Botanica Malacitana | Acta Botanica Malacitana |
17 | Acta Botanica Mexicana | Acta Botánica Mexicana |
18 | Acta Botanica Venezuelica | Acta Botánica Venezuélica |
19 | Acta Chimica Slovenica | Acta Chimica Slovenica |
print 'Closed: {0}, Open: {1}, Full list: {2}'.format(len(closed.issn.dropna().unique()), len(open_access.issn.dropna().unique()), len(impact.issn.dropna().unique()))
Closed: 28234, Open: 8597, Full list: 30773
definitions from http://www.journalmetrics.com/faq.php
BY SNIP
SNIP, or Source-Normalized Impact per Paper, measures a source’s contextual citation impact. It takes into account characteristics of the source's subject field, especially the frequency at which authors cite other papers in their reference lists, the speed at which citation impact matures, and the extent to which the database used in the assessment covers the field’s literature. SNIP is the ratio of a source's average citation count per paper, and the ‘citation potential’ of its subject field. It aims to allow direct comparison of sources in different subject fields.
top_impact_all = impact[['Source Title', '2011 SNIP2']].copy()
top_impact_all = pd.DataFrame(top_impact_all.sort('2011 SNIP2', ascending=False).dropna(), columns=['Source Title', '2011 SNIP2'])
top_impact_all['2011 SJR2'] = impact['2011 SJR2']
top_impact_all['Difference'] = top_impact_all['2011 SNIP2'] - top_impact_all['2011 SJR2']
top_impact_all.head(15)
Source Title | 2011 SNIP2 | 2011 SJR2 | Difference | |
---|---|---|---|---|
4874 | CA - A Cancer Journal for Clinicians | 41.082 | 24.976 | 16.106 |
10365 | Foundations and Trends in Information Retrieval | 32.028 | 10.411 | 21.617 |
26384 | Reviews of Modern Physics | 22.129 | 36.194 | -14.065 |
110 | ACM Computing Surveys | 17.848 | 9.926 | 7.922 |
16434 | Journal of Engineering Education | 16.072 | 1.358 | 14.714 |
22052 | New England Journal of Medicine | 14.971 | 9.740 | 5.231 |
25177 | Progress in Materials Science | 13.535 | 10.127 | 3.408 |
2280 | Annual Review of Psychology | 12.013 | 8.137 | 3.876 |
1004 | Advances in Physics | 11.750 | 26.216 | -14.466 |
12309 | IEEE Communications Surveys and Tutorials | 11.584 | 6.315 | 5.269 |
23931 | Physics Reports | 11.363 | 10.761 | 0.602 |
5538 | Chemical Reviews | 11.350 | 15.866 | -4.516 |
16321 | Journal of Economic Literature | 10.738 | 13.121 | -2.383 |
2257 | Annual Review of Immunology | 10.680 | 31.166 | -20.486 |
25160 | Progress in Energy and Combustion Science | 10.579 | 6.430 | 4.149 |
BY SJR
SJR, or SCImago Journal Rank, is a measure of the scientific prestige of scholarly sources.
SJR assigns relative scores to all of the sources in a citation network. Its methodology is inspired by the Google PageRank algorithm, in that not all citations are equal. A source transfers its own 'prestige', or status, to another source through the act of citing it. A citation from a source with a relatively high SJR is worth more than a citation from a source with a lower SJR.
top_impact_all.sort('2011 SJR2', ascending=False).dropna().head(15)
Source Title | 2011 SNIP2 | 2011 SJR2 | Difference | |
---|---|---|---|---|
26384 | Reviews of Modern Physics | 22.129 | 36.194 | -14.065 |
2257 | Annual Review of Immunology | 10.680 | 31.166 | -20.486 |
1004 | Advances in Physics | 11.750 | 26.216 | -14.466 |
4874 | CA - A Cancer Journal for Clinicians | 41.082 | 24.976 | 16.106 |
2232 | Annual Review of Biochemistry | 8.276 | 23.856 | -15.580 |
21729 | Nature Genetics | 7.211 | 19.919 | -12.708 |
5311 | Cell | 6.579 | 19.779 | -13.200 |
2265 | Annual Review of Neuroscience | 7.609 | 17.014 | -9.405 |
2254 | Annual Review of Genetics | 5.113 | 16.628 | -11.515 |
25646 | Quarterly Journal of Economics | 6.621 | 16.230 | -9.609 |
5538 | Chemical Reviews | 11.350 | 15.866 | -4.516 |
21732 | Nature Materials | 7.960 | 15.413 | -7.453 |
2276 | Annual Review of Plant Biology | 10.257 | 14.740 | -4.483 |
21709 | Nature | 8.647 | 14.548 | -5.901 |
21731 | Nature Immunology | 3.912 | 14.286 | -10.374 |
open_lang = open_access.Language.value_counts().head(10)/len(open_access)*100
open_lang
English 55.810166 Spanish 6.036990 Portuguese 3.315110 Spanish, English 3.222054 English, French 1.395836 Portuguese, English 1.291148 Portuguese, Spanish 1.116669 English, Spanish 1.058509 French 1.046877 English 0.849133
plt.figure()
open_lang.plot(kind='bar', title='Most common languages, open source journals (%)', color='green', alpha=.3);
plt.show()
open_lang = open_access.Keyword.value_counts().head(15)/len(open_access)*100
open_lang
health sciences 0.988717 medicine 0.523438 education 0.442015 mathematics 0.418751 medical sciences 0.348959 psychology 0.348959 human sciences 0.325695 philosophy 0.279167 biological sciences 0.255903 social sciences 0.232639 chemistry 0.209375 agricultural sciences 0.209375 law 0.162848 public health 0.162848 engineering 0.151216
open_lang.plot(kind='bar', title='Most common keywords, open source journals (%)', color='green', alpha=.3);
open_access.to_csv('open_access.csv')
plt.figure()
open_access['Start Year'].hist(range=(1980, 2012), bins=30, color='green', alpha=.3)
plt.title('Histogram of start year, open access journals');
timeline = open_access.sort('Start Year')
timeline[['Title', 'Start Year', 'End Year']].head(10)
Title | Start Year | End Year | |
---|---|---|---|
914 | Bijdragen Tot de Taal-, Land- en Volkenkunde v... | 1853 | 1948 |
6652 | Psyche : A Journal of Entomology | 1874 | NaN |
2616 | Fishery Bulletin | 1881 | NaN |
1222 | Bulletin of the American Museum of Natural His... | 1881 | NaN |
7868 | South African Medical Journal | 1884 | NaN |
1225 | Bulletin of the Geological Society of Denmark | 1894 | NaN |
5544 | Memórias do Instituto Oswaldo Cruz. | 1909 | NaN |
4561 | Journal of Genetics | 1910 | NaN |
1232 | Bulletin of the Medical Library Association | 1911 | 2001 |
5773 | Nieuwe West-Indische Gids | 1919 | 1991 |
fee = open_access['Publication fee'].value_counts()/len(open_access)*100
fee
No 66.418518 Yes 27.939979 Conditional 2.954519 Information missing 2.512504
fee.plot(kind='bar', title='Histogram of fee required (%)', color='green', alpha=.3, rot=0);
There are 6,059 open access journals that do not have a ranking, and will therefore not be included in the analysis.
len(open_access) - len(matches)
6059
closed_field = closed[['Physical sciences', 'Life sciences', 'Social sciences']].count()/len(closed)
open_field = matches[['Physical sciences', 'Life sciences', 'Social sciences']].count()/len(matches)
closed_field.plot(color='red', kind='bar', alpha=.7);
tmp = pd.DataFrame(open_field, columns=['Open access'])
tmp['Closed access'] = closed_field
plt.figure()
tmp.plot(kind='bar', color=['green', 'red'], alpha=.5, title='Comparison of discipline', rot=0);
countries = pd.DataFrame(closed.Country.value_counts(), columns=['closed'])
countries['open'] = open_access.Country.value_counts()
countries['proportion_oa'] = countries['open']/countries['closed']
countries_sorted = countries.sort('proportion_oa', ascending=False)
countries_sorted[countries_sorted.proportion_oa >= 0][:10]
closed | open | proportion_oa | |
---|---|---|---|
Colombia | 6 | 208 | 34.666667 |
Costa Rica | 1 | 26 | 26.000000 |
Egypt | 20 | 351 | 17.550000 |
Chile | 17 | 142 | 8.352941 |
Indonesia | 6 | 45 | 7.500000 |
Peru | 4 | 29 | 7.250000 |
Cuba | 7 | 50 | 7.142857 |
Brazil | 121 | 806 | 6.661157 |
Venezuela | 16 | 85 | 5.312500 |
Tunisia | 2 | 10 | 5.000000 |
plt.figure()
countries.closed.head(10).plot(color='red', kind='bar', alpha=.5)
countries.open.head(10).plot(color='green', kind='bar', alpha=.6, title='Comparison of top journal producers (%)\nGreen=Open access, red=Closed access', rot=30);
countries_sorted
<class 'pandas.core.frame.DataFrame'> Index: 114 entries, Moldova, Republic of to Netherlands Data columns: closed 114 non-null values open 91 non-null values proportion_oa 91 non-null values dtypes: float64(2), int64(1)
countries_sorted.proportion_oa[countries_sorted.proportion_oa > 0].head(10).plot(kind='bar', color ='g', rot=30, alpha=.5,
title ='Countries with highest proportion of OA journals (%)');
snip_dist = pd.DataFrame(closed['2011 SNIP2'], columns=['2011 closed SNIP'])
snip_dist['2011 open SNIP'] = matches['2011 SNIP2']
All SNIP scores (outliers clipped)
snip_dist[snip_dist['2011 closed SNIP'] <15].boxplot(sym='m+');
snip_dist.describe()
2011 closed SNIP | 2011 open SNIP | |
---|---|---|
count | 16617.000000 | 1906.000000 |
mean | 0.831351 | 0.570842 |
std | 0.980973 | 0.520289 |
min | 0.000000 | 0.000000 |
25% | 0.266000 | 0.203000 |
50% | 0.675000 | 0.480500 |
75% | 1.118000 | 0.806750 |
max | 41.082000 | 4.814000 |
sjr_dist = pd.DataFrame(closed['2011 SJR2'], columns=['2011 closed SJR'])
sjr_dist['2011 open SJR'] = matches['2011 SJR2']
Outliers clipped
sjr_dist[sjr_dist['2011 closed SJR']<15].boxplot(sym='m+');
sjr_dist.describe()
2011 closed SJR | 2011 open SJR | |
---|---|---|
count | 16881.000000 | 1970.000000 |
mean | 0.639512 | 0.383745 |
std | 1.142131 | 0.558075 |
min | 0.000000 | 0.000000 |
25% | 0.140000 | 0.128000 |
50% | 0.319000 | 0.217000 |
75% | 0.743000 | 0.408750 |
max | 36.194000 | 7.581000 |
def find_snip(db):
snip_out = db[['1999 SNIP2', '2000 SNIP2', '2001 SNIP2', '2002 SNIP2', '2003 SNIP2', '2004 SNIP2', '2005 SNIP2', '2006 SNIP2',
'2007 SNIP2', '2008 SNIP2', '2009 SNIP2', '2010 SNIP2', '2011 SNIP2']]
return snip_out
def select_snip_by_era(country_db):
snip = find_snip(country_db)
era1 = snip[country_db['Start Year'] < 1996]
era2 = snip[(country_db['Start Year'] >= 1996) & (country_db['Start Year'] <= 2001)]
era3 = snip[country_db['Start Year'] > 2001]
return snip, era1, era2, era3
def find_sjr(db):
sjr_out = db[['1999 SJR2', '2000 SJR2', '2001 SJR2', '2002 SJR2', '2003 SJR2', '2004 SJR2', '2005 SJR2', '2006 SJR2',
'2007 SJR2', '2008 SJR2', '2009 SJR2', '2010 SJR2', '2011 SJR2']]
return sjr_out
def select_sjr_by_era(country_db):
sjr = find_sjr(country_db)
era1 = sjr[country_db['Start Year'] < 1996]
era2 = sjr[(country_db['Start Year'] >= 1996) & (country_db['Start Year'] <= 2001)]
era3 = sjr[country_db['Start Year'] > 2001]
return sjr, era1, era2, era3
open_years = find_snip(matches)
closed_years = find_snip(closed)
plt.figure()
open_years.median().plot(style='g');
closed_years.median().plot(style='--', title='Median SNIP score\nGreen = Open access\nRed = Closed access', rot=30, color='r');
clean = pd.DataFrame(open_years.median(), columns=['open'])
clean['closed'] = closed_years.median()
clean['median_difference'] = clean.closed - clean.open
plt.figure()
clean.median_difference.plot(title='Median difference (closed median - open median)', rot=30, style='m');
clean
open | closed | median_difference | |
---|---|---|---|
1999 SNIP2 | 0.3340 | 0.6810 | 0.3470 |
2000 SNIP2 | 0.3640 | 0.6750 | 0.3110 |
2001 SNIP2 | 0.2895 | 0.6090 | 0.3195 |
2002 SNIP2 | 0.3040 | 0.5890 | 0.2850 |
2003 SNIP2 | 0.3310 | 0.6010 | 0.2700 |
2004 SNIP2 | 0.3890 | 0.6170 | 0.2280 |
2005 SNIP2 | 0.4330 | 0.6500 | 0.2170 |
2006 SNIP2 | 0.4600 | 0.6310 | 0.1710 |
2007 SNIP2 | 0.3970 | 0.6130 | 0.2160 |
2008 SNIP2 | 0.3890 | 0.6170 | 0.2280 |
2009 SNIP2 | 0.3990 | 0.6285 | 0.2295 |
2010 SNIP2 | 0.4040 | 0.6400 | 0.2360 |
2011 SNIP2 | 0.4745 | 0.6750 | 0.2005 |
matches[['1999 SNIP2', '2011 SNIP2']].describe()
1999 SNIP2 | 2011 SNIP2 | |
---|---|---|
count | 459.000000 | 2134.000000 |
mean | 0.445412 | 0.564373 |
std | 0.484183 | 0.509639 |
min | 0.000000 | 0.000000 |
25% | 0.147000 | 0.201250 |
50% | 0.334000 | 0.474500 |
75% | 0.592500 | 0.807000 |
max | 3.797000 | 4.814000 |
matches[['1999 SNIP2', '2011 SNIP2']].median()
1999 SNIP2 0.3340 2011 SNIP2 0.4745
matches['1999 SNIP2'].hist()
<matplotlib.axes.AxesSubplot at 0x10e5b1d90>
open_years_sjr = find_sjr(matches)
closed_years_sjr = find_sjr(closed)
open_years_sjr.median().plot(style='green');
closed_years_sjr.median().plot(style='--', title='Median SJR score\nGreen=closed journals\nBlue=open journals', rot=30);
sjr_diff = closed_years_sjr.median() - open_years_sjr.median()
sjr_diff
1999 SJR2 0.113 2000 SJR2 0.121 2001 SJR2 0.126 2002 SJR2 0.131 2003 SJR2 0.139 2004 SJR2 0.151 2005 SJR2 0.169 2006 SJR2 0.193 2007 SJR2 0.112 2008 SJR2 0.123 2009 SJR2 0.125 2010 SJR2 0.109 2011 SJR2 0.103
sjr_diff.plot(title='Closed journal SJR advantage over open journal', rot=30, ylim=(.0, .75), style='m');
matches[['1999 SJR2', '2011 SJR2']].describe()
1999 SJR2 | 2011 SJR2 | |
---|---|---|
count | 2201.000000 | 2201.000000 |
mean | 0.077542 | 0.378843 |
std | 0.300900 | 0.542297 |
min | 0.000000 | 0.000000 |
25% | 0.000000 | 0.128000 |
50% | 0.000000 | 0.216000 |
75% | 0.101000 | 0.408000 |
max | 6.916000 | 7.581000 |
country_list = ['United States', 'United Kingdom', 'Germany','Netherlands']
nonlst = pd.DataFrame(impact.Country.value_counts()[4:])
nonlst = list(nonlst.index)
oa_big4 = matches[matches.Country_x.isin(country_list)]
oa_nonbig4 = matches[matches.Country_x.isin(nonlst)]
Separate OA and closed by era, then by SJR and SNIP
oa_snip, oa_era1, oa_era2, oa_era3 = select_snip_by_era(oa_big4)
non_snip, non_era1, non_era2, non_era3 = select_snip_by_era(oa_nonbig4)
oasjr, oasjr1, oasjr2, oasjr3 = select_sjr_by_era(oa_big4)
nonsjr, nonsjr1, nonsjr2, nonsjr3 = select_sjr_by_era(oa_nonbig4)
Find mean SJR of all journals, by year, and by whether they are one of the big 4 countries.
median_sjr = pd.DataFrame([oasjr1['2011 SJR2'].median(), oasjr2['2011 SJR2'].median(), oasjr3['2011 SJR2'].median()])
median_sjr_col = median_sjr
tmp = median_sjr_col.rename(columns={0: 'Big4 Median SJR'}).T
median_sjr_big4 = tmp.rename(columns={0:'1996', 1: '1996-2001', 2:'2002-2011'})
OA_sjr_big4 = median_sjr_big4.T
OA_sjr_big4['Non-Big4 mean SJR'] = ([nonsjr1['2011 SJR2'].median(), nonsjr2['2011 SJR2'].median(), nonsjr3['2011 SJR2'].median()])
OA_sjr_big4.T
1996 | 1996-2001 | 2002-2011 | |
---|---|---|---|
Big4 Median SJR | 0.489 | 0.5005 | 0.418 |
Non-Big4 mean SJR | 0.223 | 0.1940 | 0.171 |
Newer journals appear to be less impactful in 2011 than those started before 1996
OA_sjr_big4.T.plot(kind='bar', rot=0,title='2011 median SJR of OA journals from Big 4, by start year\n Big 4 = USA, UK, Netherlands & Germany');
median_snip = pd.DataFrame([oa_era1['2011 SNIP2'].median(), oa_era2['2011 SNIP2'].median(), oa_era3['2011 SNIP2'].median()])
median_snip_col = median_snip
tmp = median_snip_col.rename(columns={0: 'Big4 Mean SNIP'}).T
median_snip_big4 = tmp.rename(columns={0:'1996', 1: '1996-2001', 2:'2002-2011'})
OA_big4 = median_snip_big4.T
OA_big4['Non-Big4 median SNIP'] = ([non_era1['2011 SNIP2'].median(), non_era2['2011 SNIP2'].median(), non_era3['2011 SNIP2'].median()])
OA_big4.T
1996 | 1996-2001 | 2002-2011 | |
---|---|---|---|
Big4 Mean SNIP | 0.872 | 0.834 | 0.812 |
Non-Big4 median SNIP | 0.554 | 0.403 | 0.340 |
OA_big4.T.plot(kind='bar', rot=0,title='Median SNIP of OA journals from Big 4, by start year\n Big 4 = USA, UK, Netherlands & Germany');
big4_closed = closed[closed.Country.isin(country_list)]
non_closed = closed[closed.Country.isin(nonlst)]
big4_closed_snip = find_snip(big4_closed)
non_closed_snip = find_snip(non_closed)
big4_closed_sjr = find_sjr(big4_closed)
non_closed_sjr = find_sjr(non_closed)
nevermind about start year, these analysis are all about access, country status, and SJR year.
Good news, Big4 OA journals have been shooting up in terms of impcat
plt.figure()
sjr_combo = pd.DataFrame([oasjr.median(), nonsjr.median(), big4_closed_sjr.median(), non_closed_sjr.median()]).T
sjr_combo = sjr_combo.rename(columns={0: 'Big4 OA', 1: 'Non-Big4 OA', 2: 'Big4 closed', 3: 'Non-Big4 closed'})
sjr_combo.plot(rot=30);
plt.show()
sjr_combo['OA diff'] = sjr_combo['Big4 closed'] - sjr_combo['Big4 OA']
sjr_combo['closed diff'] = sjr_combo['Non-Big4 closed'] - sjr_combo['Non-Big4 OA']
sjr_combo[['OA diff', 'closed diff']].plot(rot=30, title='SJR difference, OA and closed');
sjr_combo.describe()
Big4 OA | Non-Big4 OA | Big4 closed | Non-Big4 closed | OA diff | closed diff | |
---|---|---|---|---|---|---|
count | 13.000000 | 13.000000 | 13.000000 | 13.000000 | 13.000000 | 13.000000 |
mean | 0.143077 | 0.051462 | 0.293769 | 0.101000 | 0.150692 | 0.049538 |
std | 0.166178 | 0.070902 | 0.099030 | 0.049156 | 0.071100 | 0.053258 |
min | 0.000000 | 0.000000 | 0.173000 | 0.000000 | 0.015000 | -0.024000 |
25% | 0.000000 | 0.000000 | 0.210000 | 0.101000 | 0.114000 | 0.000000 |
50% | 0.106000 | 0.000000 | 0.264000 | 0.106000 | 0.166000 | 0.020000 |
75% | 0.260000 | 0.104000 | 0.374000 | 0.124000 | 0.205000 | 0.101000 |
max | 0.445000 | 0.186000 | 0.460000 | 0.162000 | 0.245000 | 0.111000 |
kind_combo = pd.DataFrame([oa_snip.median(), non_snip.median(), big4_closed_snip.median(), non_closed_snip.median()]).T
kind_combo = kind_combo.rename(columns={0: 'Big4 OA', 1: 'non-Big4 OA', 2: 'Big4 closed', 3: 'Non-Big4 closed'})
kind_combo.describe()
Big4 OA | non-Big4 OA | Big4 closed | Non-Big4 closed | |
---|---|---|---|---|
count | 13.000000 | 13.000000 | 13.000000 | 13.000000 |
mean | 0.634885 | 0.302077 | 0.783808 | 0.247654 |
std | 0.144327 | 0.041310 | 0.035933 | 0.042857 |
min | 0.420000 | 0.239000 | 0.723000 | 0.201000 |
25% | 0.522500 | 0.268000 | 0.770000 | 0.216500 |
50% | 0.631000 | 0.312000 | 0.781000 | 0.236500 |
75% | 0.759500 | 0.326500 | 0.809000 | 0.262000 |
max | 0.823000 | 0.386000 | 0.847000 | 0.324000 |
kind_combo.to_csv("snip_origin_year.csv")
kind_combo.plot(rot = 30, title='Median SNIP by year, access status and country of origin');
kind_combo['OA diff'] = kind_combo['Big4 closed'] - kind_combo['Big4 OA']
kind_combo['closed diff'] = kind_combo['Non-Big4 closed'] - kind_combo['non-Big4 OA']
kind_combo[['OA diff', 'closed diff']].plot(rot = 30, title='SNIP difference, OA and closed');
def country_impact(df):
impacts = df.groupby(by='Country')
pop_countries = pd.DataFrame(impact.Country.value_counts(), columns=['Num_journals_total'])
country_median = pd.DataFrame(impacts['2011 SNIP2'].median(), columns=['Median snip'])
country_median['Num_journals_total'] = pop_countries.Num_journals_total
country_median['Median SJR'] = impacts['2011 SJR2'].median()
plt.figure()
plt.scatter(x=log(country_median['Num_journals_total']), y=country_median['Median snip']);
plt.scatter(x=log(country_median['Num_journals_total']), y=country_median['Median SJR'], c='y', marker='+');
plt.title('Median impact by log of number of journals published in that country\nBlue is SNIP, yellow is SJR')
clean = country_median.dropna(how='any')
slope, intercept, r_value, p_value, std_err = stats.linregress(clean['Median snip'], clean['Num_journals_total'])
print 'SNIP. r = {0}, p = {1}'.format(r_value, p_value)
slope1, intercept1, r_value1, p_value1, std_err1 = stats.linregress(clean['Median SJR'], clean['Num_journals_total'])
print 'SJR. r = {0}, p = {1}'.format(r_value1, p_value1)
return country_median
For both open and closed access journals, the more journals that country has, the higher the average journal's impact rating in that country. In other words, for countries that have a lot of journals, each journal in that country is more impactful
open_access = country_impact(matches)
SNIP. r = 0.30092808042, p = 0.00704211226706 SJR. r = 0.388444556851, p = 0.000403882110151
closed_access = country_impact(closed)
SNIP. r = 0.466675569107, p = 3.1207456288e-06 SJR. r = 0.60358379053, p = 2.40787599202e-10
combo = pd.merge(open_access, closed_access, suffixes=('_open', '_closed'), left_on=open_access.index, right_on=closed_access.index)
combo = combo.set_index('key_0')
combo['snip_diff'] = combo['Median snip_open'] - combo['Median snip_closed']
combo['sjr_diff'] = combo['Median SJR_open'] - combo['Median SJR_closed']
for_plot = combo[combo['Num_journals_total_open'] >= 150]
On the whole, open access journals have greater impact than non-OA.
for_plot.snip_diff.plot(kind='bar', title='Median snip open - median snip closed\n(Negative means closed has higher snip)');
for_plot.sjr_diff.plot(kind='bar', title='SJR');
combo.snip_diff.hist(bins=30);
combo.snip_diff.describe()
count 73.000000 mean 0.119822 std 0.253111 min -0.900000 25% 0.035500 50% 0.139500 75% 0.272000 max 0.776000
combo.sjr_diff.hist(bins=30);
combo.sjr_diff.describe()
count 74.000000 mean 0.033054 std 0.110953 min -0.479000 25% -0.004000 50% 0.028750 75% 0.075625 max 0.455000
combo.head()
Median snip_open | Num_journals_total_open | Median SJR_open | Median snip_closed | Num_journals_total_closed | Median SJR_closed | snip_diff | sjr_diff | |
---|---|---|---|---|---|---|---|---|
key_0 | ||||||||
Argentina | 0.281 | 60 | 0.142 | 0.213 | 60 | 0.1235 | 0.068 | 0.0185 |
Australia | 0.267 | 390 | 0.202 | 0.445 | 390 | 0.2260 | -0.178 | -0.0240 |
Austria | 0.104 | 79 | 0.116 | 0.225 | 79 | 0.1240 | -0.121 | -0.0080 |
Bangladesh | 0.734 | 22 | 0.261 | 0.227 | 22 | 0.1280 | 0.507 | 0.1330 |
Belgium | 0.519 | 181 | 0.183 | 0.199 | 181 | 0.1245 | 0.320 | 0.0585 |
clean = combo.dropna(how='any')
slope1, intercept1, r_value1, p_value1, std_err1 = stats.linregress(clean['snip_diff'], clean['Num_journals_total_open'])
print r_value1, p_value1
-0.158138988222 0.181469148274
plt.figure()
plt.scatter(log(clean['Num_journals_total_open']), clean['snip_diff']);