a whirlwind tour
Martin Fenner @mfenner
Ian Mulvany @ianmulvany
10.x/y where x is 4-5 digits (and 10.x is the DOI prefix) and y (the suffix) can be anything.
There are 9 different DOI registration agencies. Not all of them register scholarly content, e.g. the Publications Office of the European Union (OP) and the Entertainment Identifier Registry (EIDR) do not.
Different services can be built using DOIs, the reason the registration agency CrossRef was started was to facilitate citation linking between many publishers
Some of the functionality is a service of the registration agency - metadata search for example is offered by CrossRef (http://search.crossref.org) and DataCite (http://search.datacite.org)
To find out what registration agency registered a DOI, use http://api.crossref.org/works/10.6084/m9.figshare.821213/agency
DOI names may be expressed as URLs (URIs) through a HTTP proxy server - e.g. http://dx.doi.org/10.5555/12345679 , and this is how DOIs are typically resolved. Because DOIs can be expressed as URLs
-- a little bit magical
-- fast
-- decentralised
-- open source
-- opinionated
-- cryptic
-- powerful
from IPython.display import Image
i = Image(filename='images/8320552323_13cfe4b081_b.jpg')
f = Image(url='http://b.z19r.com/upload/did-you-just-tell-me-to-go-fuck-myself.jpg')
happy = Image(filename='images/5365366130_4ecf6c025f_b.jpg')
tools = Image(filename='images/ScienceToolbox_-_Open_science_software.png')
i
f
happy
Image CC-BY-NC-SA Ola Lindberg Flickr
# lets test embedding a html page into the presentation
from IPython.display import HTML
HTML('<iframe src="http://gitready.com" width=600 height=400></iframe>')
# http://gitready.com
HTML('<iframe src="https://mac.github.com" width=600 height=400></iframe>')
# https://mac.github.com
HTML('<iframe src="https://guides.github.com/activities/citable-code/" width=600 height=400></iframe>')
# https://guides.github.com/activities/citable-code/
tools
Work on visualization of scientific data should start with a good understanding of the best practices and pitfalls of data visualization in general, as well as the specific aspects of visualizing scientific data.
Excel, R, d3.js, Datawrapper, Prism, ...
You need to know at least the basics of data analysis to do proper data visualizations, e.g. how to handle wrongly formatted data (e.g. text in a number column), missing values and outliers.
Data analysis becomes much easier with a dedicated data analysis language such as R, Python or Julia
The most time-consuming step in my experience is data transformation, i.e. bringing data into the format that you want for the analysis and visualization.
Bitmap graphic formats such as tiff
, jpg
and png
are not appropriate for charts. Use vector formats such as svg
or pdf
, and make the data underlying the figure available.
We should become more creative with visualizing data in scholarly documents. These graphs often focus too much on detail rather than the overall message, don't take advantage of the different chart types available, and are sometimes even misleading
One important step towards that goal is publishers accepting more reasonable file formats for figures coming with submitted manuscript - instead of just tiff
and eps
, and with a 10 MB file size limit.
HTML('<iframe src="http://ipython.org/notebook.html" width=600 height=400></iframe>')
http://ipython.org/notebook.html
https://github.com/ipython/ipython
[Plotting with Bokeh](http://bokeh.pydata.org/docs/quickstart.html#downloading http://nbviewer.ipython.org/github/ContinuumIO/bokeh-notebooks/blob/master/quickstart/quickstart.ipynb)
http://nbviewer.ipython.org/gist/empet/eeb8bbe354e709bf590b
http://www.randalolson.com/2013/01/14/filling-in-pythons-gaps-in-statistics-packages-with-rmagic/
http://nbviewer.ipython.org/github/ipython/ipython/blob/3607712653c66d63e0d7f13f073bde8c0f209ba8/docs/examples/notebooks/rmagic_extension.ipynb
http://nbviewer.ipython.org/gist/yoavram/5280132
https://github.com/takluyver/IRkernel
https://github.com/JuliaLang/IJulia.jl
# get recent DOIs deposited by PLOS using the CrosRef API
import requests
url = "http://api.crossref.org/members/340/works?filter=from-update-date:2014-07-21,until-update-date:2014-07-24&rows=1000"
r = requests.get(url)
def get_dois_from_response(r):
dois = []
crossref_json = r.json()
pub_items = crossref_json["message"]["items"]
for item in pub_items:
dois.append(item["DOI"])
return dois
# extract the dois from the API response
dois = get_dois_from_response(r)
print (dois[0:3])
[u'10.1371/journal.pone.0102119.g004', u'10.1371/journal.pone.0102130.s001', u'10.1371/journal.pone.0102119.t001']
def return_parent_doi(doi):
# if the doi is a child doi, strip the tail, and return the parent
doi_parts = doi.split(".")
main_doi = ".".join(doi_parts[0:4])
return main_doi
# chop off the child parts of our of our dois
def get_parent_dois(dois):
parent_dois = []
for doi in dois:
parent_dois.append(return_parent_doi(doi))
return parent_dois
# just get the top level DOIS
parent_dois = get_parent_dois(dois)
print (parent_dois[0:3])
[u'10.1371/journal.pone.0102119', u'10.1371/journal.pone.0102130', u'10.1371/journal.pone.0102119']
# get a set of unique set of dois
def get_unique_dois(dois):
unique_dois = []
for doi in dois:
if doi in unique_dois:
continue
else:
unique_dois.append(doi)
return unique_dois
unique_dois = get_unique_dois(parent_dois)
print (unique_dois[0:3])
[u'10.1371/journal.pone.0102119', u'10.1371/journal.pone.0102130', u'10.1371/journal.pone.0102076']
# now we import an interface to the PLOS ALM app
import pyalm.pyalm as alm
In order to get the PLOS alm api to work you need to register for an API key. You do this by creating an account at PLOS, and then logging in to alm.plos.org. You then set the api key in the config file for pyalm, which is in pyalm/api_key.py
test_article = alm.get_alm(str(unique_dois[12]), info="event", source="mendeley")
print test_article[0].sources["mendeley"].metrics.total
2
article_data = {}
for doi in unique_dois:
try:
article = alm.get_alm(str(doi), info="summary")
article_mendeley = alm.get_alm(str(doi), info="event", source="mendeley")
mendeley_metrics = article_mendeley[0].sources["mendeley"].metrics.total
article_data[doi] = {"views":article[0].views, "citations":article[0].citations, "mendeley": mendeley_metrics}
except:
print "no data for ", doi
%matplotlib inline
values = article_data.values()
def get_views(x): return x["views"]
def get_cites(x): return x["citations"]
def get_mendeley(x): return x["mendeley"]
all_views = map(get_views, values)
all_cites = map(get_cites, values)
all_mendeley = map(get_mendeley, values)
print values[0:3]
print all_views[0:3]
print all_cites[0:3]
print all_mendeley[0:3]
[{'mendeley': 0, 'citations': 0, 'views': 164}, {'mendeley': 1, 'citations': 0, 'views': 415}, {'mendeley': 0, 'citations': 0, 'views': 127}] [164, 415, 127] [0, 0, 0] [0, 1, 0]
import matplotlib.pyplot as plt
import numpy as np
n = array([23,34,56,78])
pos = array([1,2,3,4])
pos = array(range(len(all_views)))
fig, ax = plt.subplots(1, 1)
sorted_views = all_views.sort()
ax.bar(pos,all_views)
<Container object of 118 artists>
fig, ax = plt.subplots(1, 1)
ax.scatter(all_cites,all_views)
<matplotlib.collections.PathCollection at 0x112932250>
fig, ax = plt.subplots(1, 1)
ax.scatter(all_mendeley,all_views)
<matplotlib.collections.PathCollection at 0x10d390790>
# The code behind this presentation can be viewed on nb viewer
HTML('<iframe src="http://bit.ly/Wikimandia2014OpenScholarshopTools" width=600 height=400></iframe>')
HTML('<iframe src="http://londonopendrinks.org" width=600 height=400></iframe>')
# http://londonopendrinks.org