Get some data...
funder='http://roarmap.eprints.org/cgi/exportview/policymaker_type/funder/JSON/funder.js'
funderAndResearch='http://roarmap.eprints.org/cgi/exportview/policymaker_type/funder=5Fand=5Fresearch=5Forg/JSON/funder=5Fand=5Fresearch=5Forg.js'
multi_res_orgs='http://roarmap.eprints.org/cgi/exportview/policymaker_type/multiple=5Fresearch=5Forgs/JSON/multiple=5Fresearch=5Forgs.js'
res_org='http://roarmap.eprints.org/cgi/exportview/policymaker_type/research=5Forg/JSON/research=5Forg.js'
sub_unit='http://roarmap.eprints.org/cgi/exportview/policymaker_type/research=5Forg=5Fsubunit/JSON/research=5Forg=5Fsubunit.js'
We need to install a library to help scrape the country codes.
!pip install beautifulsoup4
Collecting beautifulsoup4 Downloading beautifulsoup4-4.3.2.tar.gz (143kB) 100% |████████████████████████████████| 143kB 2.4MB/s Installing collected packages: beautifulsoup4 Running setup.py install for beautifulsoup4 Successfully installed beautifulsoup4-4.3.2
import pandas as pd
import requests
%matplotlib inline
#Need this for charting
#use seaborn for prettier charts
import seaborn as sns
One way is to go in by org type (not sure if there is a single list?) and then combine the results into a single data table.
funder_df=pd.read_json(funder)
funderAndResearch_df=pd.read_json(funderAndResearch)
multi_res_orgs_df=pd.read_json(multi_res_orgs)
res_org_df=pd.read_json(res_org)
sub_unit_df=pd.read_json(sub_unit)
df=pd.concat([funder_df,funderAndResearch_df,multi_res_orgs_df,res_org_df,sub_unit_df])
#How many records do we have?
len(df)
693
#Preview top few rows
df.head()
added_by | apc_fun_url | apc_funding | can_deposit_be_waived | country | country_inclusive | creators | date | date_made_open | date_of_deposit | ... | rights_holding | rights_retention_waivable | source_of_policy | status_changed | title | type | uri | userid | volume | waive_open_access | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | EOS | NaN | not_mentioned | not_applicable | 250 | [un_geoscheme, 150, 155, 250] | NaN | NaT | not_mentioned | not_specified | ... | not_mentioned | not_applicable | administrative | 2014-12-15 22:09:31 | Agence National de la recherche (ANR) | article | http://roarmap.eprints.org/id/eprint/136 | 1 | NaN | not_specified |
1 | EOS | NaN | not_mentioned | not_specified | 250 | [un_geoscheme, 150, 155, 250] | NaN | NaT | publication | not_specified | ... | not_mentioned | not_applicable | administrative | 2014-12-15 22:09:32 | Agence National de la recherche (ANR) Humaniti... | article | http://roarmap.eprints.org/id/eprint/137 | 1 | NaN | not_specified |
2 | EOS | NaN | not_mentioned | no | 702 | [un_geoscheme, 142, 35, 702] | NaN | NaT | embargo | acceptance | ... | none | not_applicable | administrative | 2014-12-15 22:09:19 | Agency for Science, Technology & Research (A*S... | article | http://roarmap.eprints.org/id/eprint/44 | 236 | NaN | not_specified |
3 | EOS | NaN | additional_funding | no | 826 | [un_geoscheme, 150, 154, 826] | NaN | NaT | embargo | embarg | ... | none | not_applicable | administrative | 2014-12-15 22:10:19 | Arts & Humanities Research Council (AHRC) | article | http://roarmap.eprints.org/id/eprint/342 | 1 | NaN | not_specified |
4 | EOS | NaN | not_mentioned | not_specified | 36 | [un_geoscheme, 9, 53, 36] | NaN | NaT | other | other | ... | none | not_applicable | administrative | 2014-12-15 22:10:47 | Australian Research Council | article | http://roarmap.eprints.org/id/eprint/566 | 1 | NaN | not_specified |
5 rows × 57 columns
#Countries have a code rather than country name associated with them...
from bs4 import BeautifulSoup
#Grab web page containing country codes
soup=BeautifulSoup(requests.get('http://roarmap.eprints.org/cgi/search/advanced').content)
countries=soup.find('select',id='country').findAll('option')
#generate lookup from country codes to country names
countryList={}
countryListZKey={}
for country in countries:
countryList[country['value']]=country.text.strip('.')
countryListZKey[country['value'].lstrip('0')]=country.text.strip('.')
#Add in country names from country codes
df['country2']=df['country'].apply(lambda x: countryListZKey[str(x)])
#Save data as a csv file
df.to_csv('roardata.csv')
#What I'd be tempted to do is load that data into RStudio and build a shiny app around it...
#Tutorial: http://shiny.rstudio.com/tutorial/
df.groupby('country2').size().order(ascending=False)
country2 United States of America 126 United Kingdom of Great Britain and Northern Ireland 95 Italy 44 Australia 33 Turkey 30 Finland 28 Spain 27 Canada 26 Germany 22 Portugal 21 France 18 Belgium 17 Brazil 16 India 14 Sweden 11 Ukraine 11 Norway 9 Indonesia 8 Switzerland 8 Ireland 8 Netherlands 8 Denmark 8 South Africa 7 Lithuania 6 Peru 6 New Zealand 6 Japan 5 Kenya 5 Austria 5 Argentina 4 Russian Federation 4 China 4 Mexico 3 Colombia 3 Venezuela (Bolivarian Republic of) 3 Estonia 3 Slovenia 3 Singapore 3 Hungary 3 China, Hong Kong Special Administrative Region 3 Iceland 3 Czech Republic 3 Azerbaijan 2 Belarus 2 Croatia 2 Zimbabwe 2 Greece 2 Poland 2 Ghana 1 Viet Nam 1 Latvia 1 Luxembourg 1 Bolivia (Plurinational State of) 1 Nigeria 1 Pakistan 1 Saudi Arabia 1 Slovakia 1 Taiwan 1 Algeria 1 dtype: int64
df.groupby('rights_holding').size().order(ascending=False)
rights_holding not_mentioned 333 author_retains 171 none 136 author_grants 35 inst_retains 15 dtype: int64
df.columns
Index(['added_by', 'apc_fun_url', 'apc_funding', 'can_deposit_be_waived', 'country', 'country_inclusive', 'creators', 'date', 'date_made_open', 'date_of_deposit', 'datestamp', 'deposit_of_item', 'dir', 'documents', 'embargo_hum_soc', 'embargo_sci_tech_med', 'eprint_status', 'eprintid', 'gold_oa_options', 'id_number', 'iliege_hefce_model', 'issn', 'journal_article_version', 'last_revision', 'lastmod', 'locus_of_deposit', 'making_deposit_open', 'mandate_content_types', 'maximal_embargo_waivable', 'metadata_visibility', 'number', 'official_url', 'open_access_waivable', 'open_licensing_conditions', 'pagerange', 'policy_adoption', 'policy_colour', 'policy_comments', 'policy_effecive', 'policy_url', 'policymaker_name', 'policymaker_type', 'policymaker_url', 'publication', 'repository_url', 'research_process_comments', 'rev_number', 'rights_holding', 'rights_retention_waivable', 'source_of_policy', 'status_changed', 'title', 'type', 'uri', 'userid', 'volume', 'waive_open_access', 'country2'], dtype='object')
df.groupby('policy_colour').size().order(ascending=False)
policy_colour green 319 black 161 amber 49 blue 48 pink 29 red 19 dtype: int64
df.groupby(['policy_colour','country2']).size().order(ascending=False)
policy_colour country2 green United States of America 60 United Kingdom of Great Britain and Northern Ireland 51 black United States of America 30 green Finland 26 black Italy 20 pink United States of America 20 green Spain 16 Australia 16 black Turkey 16 green Portugal 15 Brazil 15 Germany 14 Canada 14 black Australia 13 green Sweden 11 black Canada 11 United Kingdom of Great Britain and Northern Ireland 10 green Italy 10 amber Turkey 9 green Ireland 7 Belgium 7 Switzerland 7 red Italy 7 blue United Kingdom of Great Britain and Northern Ireland 6 France 6 green India 5 black India 5 amber France 5 green Denmark 5 Norway 5 .. black Venezuela (Bolivarian Republic of) 1 green Czech Republic 1 blue South Africa 1 green Ghana 1 blue Norway 1 Nigeria 1 green Hungary 1 red Norway 1 blue Kenya 1 India 1 Hungary 1 China, Hong Kong Special Administrative Region 1 green Japan 1 black Viet Nam 1 green Luxembourg 1 black Denmark 1 green Mexico 1 black Switzerland 1 Poland 1 green Peru 1 Russian Federation 1 black Lithuania 1 Latvia 1 Japan 1 green Argentina 1 Ukraine 1 Venezuela (Bolivarian Republic of) 1 pink Australia 1 black Finland 1 amber Belarus 1 dtype: int64
fig=df.groupby('rights_holding').size().order(ascending=False).plot(kind='barh')
fig
<matplotlib.axes._subplots.AxesSubplot at 0x7fa0effadf28>