Editors Note: Follow the instructions.
Status | Active |
Author | Nicholas Bollweg <nick.bollweg@gmail.com> |
Created | November 11, 2014 |
Updated | January 30, 2014 |
Discussion | [link to the issue where the IPEP is being discussed](#tbd) |
Implementation | [link to the PR](#tbd) |
JSON, and JSON-compatible data (e.g. ∅MQ), are already the de facto standard across the Ecosystem (IPython/Jupyter). With Notebook Format 4, some components of the broader system will only accept a de jure data representation, irrespective of whether it would have worked. The v4 schema represents a step forward in Ecosystem data: parts of the schema could be reused, explicitly stating that two parts of the system share some structure. This leaves the data consumable, but not implicitly understandable.
Additionally, a common output of Ecosystem tools is HTML, and generally seen as the "end of the line" for the life of some data. This needn't be the case, as HTML is capable of portably and recoverably storing information about its provenance, assumptions and annotations.
This IPEP suggests, where possible, the reuse of data meanings throughout the Ecosystem, or:
Use Linked Data as a representation of first resort.
Each of these uses of JSON-compatible data in the ecosystem must basically be individually understood.
bower.json
The list will continue to grow as more things are moved from code to data: whether it is static widget data-at-rest, additional services related to multi-user content, etc.
As publicly-available notebooks contain an increasing amount of important data, code and findings, it follows that finding, organizing, and referencing them will become increasingly meaningful. This task can be significantly improved by leveraging existing linked data concepts, specifically those provided by widely-adopted vocabularies like foaf and schema.org.
Identifying those core Ecosystem concepts that already fit within such categories creates immediate value, whether in the data-at-rest (.ipynb), on the wire (message format) or in transformed formats (such as nbviewer output).
As more and better features for advanced user interfaces are implemented in different kernels, the amount of reproduced code will grow. Rich meaning at the data level will help give structure to cross-language constructs.
For example, adopting a rich meaning for fields within WidgetModel
subclasses would allow for better reuse of domain concepts. Consider a date
field:
xsd:date
, i.e. ISO-8601schema:startDate
.As noted above, while Notebook Format 4 represents a significant move forward, some remaining issues arise in the content contained therein: .ipynb
files do not contain a reference to their schema, such as:
$schema
keyContent-Type
or Link
)As such, if found in the wild, a manual step of dereferencing nbformat
to the source repository would be required to find documentation of its content. Community users of master
have already discovered this.
What is a notebook? Depending on whom you ask, it is at least:
Of these formats, some lend themselves to being enhanced with linked data more than others. Of particular note are:
Adoption of these formats represents the best way towards making the content contained in the body of knowledge created by the Community more broadly discoverable. Adoption can be gradual, with each successive step providing additional features and content.
Meaning, in this prposal, means that with the data alone or a suitable reference to the data, it can be unambiguously: nbformat: 4
is not execution_count: 4
. The core elements of meaning addressed in this proposal are:
The proposed means of reusing meaning is JSON-LD contexts. Contexts provide a modular, opt-in, potentially-out-of-band means for capturing the intent of a piece of data.
To claim that some JSON can be interpreted with meaning:
#/
can@context
, i.e. #/@context
Link
header:Link: <http://ipython.org/contexts/notebook.jsonld>; rel="http://www.w3.org/ns/json-ld#context"; type="application/ld+json"
#/metadata
can@context
, i.e. #/metadata/@context
@context
, however it was definedIs some piece of data addressable in an unambiguous way? Concretely, this means that a JSON object, either through its position in a document, or through use of a keyword, has an Universal Resource Identifier, or URI. While URIs share many characteristics with their more functional twins, URLs, URIs are not neccessarily dereferenceable: this is actually beneficial, as they don't need to be migrated, hosted or otherwise maintained, just agreed upon.
Examples of things in the Ecosystem that could benefit from having explicit identity:
http://ipython.org/ns/nbformat/v4
nbf:4
Does some piece of data share properties with other pieces of data? Type is the convenient bundling of possible properties that help a user make sense of data they find.
Examples of things in the Ecosystem that could benefit from having explicit type:
While the Community is diverse and multi-lingual, the software, development process and documentation that represents the Ecosystem remains anglophilic and pythonic.
Users of the notebook, however, are already using it to publish content in many languages, both machine and natural. While kernel agnosticism is one of the current challenges, natural language agnosticism will eventually become a feature, as publishing can occur in all manner of languages. Instead of inventing new syntax for capturing this, we can adopt a single, standards-based representation.
JSON-LD provides a means for using consistent codes for:
application/ld+json
{
"@context": {
"@language": "en"
},
"cells": [
{
"cell_type": "markdown",
"source": [
"Some English\n"
]
},
]
application/ld+json
{
...
{
"cell_type": "markdown",
"source": [
{
"@language": "en",
"@value": [
"Some English\n"
]
},
{
"@language": "de",
"@value": [
"Etwas Deutsch\n"
]
}
]
},
@context
for Notebook Format 4¶As part of this discussion, the following lightweight context was proposed:
application/ld+json
{
"@context": {
"@vocab": "http://ipython.org/nbformat/v4/",
"nb4": "http://ipython.org/nbformat/v4/",
"xsd": "http://www.w3.org/2001/XMLSchema#",
"foaf": "http://xmlns.com/foaf/0.1/",
"language": {"@type": "@id"},
"codemirror_mode": {"@type": "@id"},
"cell_type": {"@id": "@type"},
"output_type": {"@id": "@type"},
"cells": {"@container": "@list"},
"source": {"@container": "@list"},
"outputs": {"@container": "@list"},
"text": {"@container": "@list"},
"traceback": {"@container": "@list"},
"tags": {"@container": "@set"},
"collapsed": {"@type": "xsd:boolean"},
"execution_count": {"@type": "xsd:int"},
"nbformat_minor": {"@type": "xsd:int"},
"nbformat": {"@type": "xsd:int"},
"signature": {"@type": "foaf:sha1"},
"image/svg+xml": {"@container": "@list"},
"image/png": {"@container": "@list"},
"text/html": {"@container": "@list"},
"text/plain": {"@container": "@list"},
"application/javascript": {"@container": "@list"}
}
}
Serving this as an out-of-band context for the raw notebook REST service and nbviewer
download would immediately create some value.
JSONLDExporter
for nbcovert
¶Being able to export a self-contained and -describing, machine-readable document from a notebook would be a good step in enabling downstream use of the data stored in notebooks.
Beyond the baseline of what could be captured with a lightweight notebook @context
, a dedicated exporter could provide additional advantages:
@context
, making the meaning in the documents unambiguous and portable--include-rdfa
to HTMLExporter¶Either as a separate output of the notebook export process, HTML can contain in-line Linked Data attributes, using the RDFa notation. Thus, when a cell or image is exported, metadata could be provided to make the document content more understandable to external agents.
nbviewer
¶As nbconvert
drives what nbviewer can display, a natural step would be to advertise and provide notebooks from across the web in a linked data format:
http://nbviewer.ipython.org/as/jsonld/github/ipython/ipython/blob/2.x/examples/Index.ipynb
http://nbviewer.ipython.org/as/rdfa/github/ipython/ipython/blob/2.x/examples/Index.ipynb
metadata
as a compacted JSON-LD document¶In the canonical Notebook front-end (the JavaScript UI), cell and notebook metadata
is still the "wild west". "Tamed" with an explicit context, and treating the output of metadata as a compacted JSON-LD document, metadata
could become the
Slideshow metadata:
{
"@context": {
"slides": "http://ipython.org/formats#slides"
},
"cells": [
{
"metadata": {
"slides:type": "slides:slide"
}
}
]
}
_This specification defines RO Bundle, a ZIP-based file format thatbundles resources which when aggregated form an identifiable conceptual work; say a collection of datasets resulting from a scientific experiment, or a gathering of logs and outputs from a particular command line execution._