Editors Note: Follow the instructions.

Status Active
Author Nicholas Bollweg <[email protected]>
Created November 11, 2014
Updated January 30, 2014
Discussion [link to the issue where the IPEP is being discussed](#tbd)
Implementation [link to the PR](#tbd)

Abstract

JSON, and JSON-compatible data (e.g. ∅MQ), are already the de facto standard across the Ecosystem (IPython/Jupyter). With Notebook Format 4, some components of the broader system will only accept a de jure data representation, irrespective of whether it would have worked. The v4 schema represents a step forward in Ecosystem data: parts of the schema could be reused, explicitly stating that two parts of the system share some structure. This leaves the data consumable, but not implicitly understandable.

Additionally, a common output of Ecosystem tools is HTML, and generally seen as the "end of the line" for the life of some data. This needn't be the case, as HTML is capable of portably and recoverably storing information about its provenance, assumptions and annotations.

This IPEP suggests, where possible, the reuse of data meanings throughout the Ecosystem, or:

Use Linked Data as a representation of first resort.

Background

JSON Islands

Each of these uses of JSON-compatible data in the ecosystem must basically be individually understood.

The list will continue to grow as more things are moved from code to data: whether it is static widget data-at-rest, additional services related to multi-user content, etc.

Motivation

Discoverability

As publicly-available notebooks contain an increasing amount of important data, code and findings, it follows that finding, organizing, and referencing them will become increasingly meaningful. This task can be significantly improved by leveraging existing linked data concepts, specifically those provided by widely-adopted vocabularies like foaf and schema.org.

Identifying those core Ecosystem concepts that already fit within such categories creates immediate value, whether in the data-at-rest (.ipynb), on the wire (message format) or in transformed formats (such as nbviewer output).

Cross-kernel implementation

As more and better features for advanced user interfaces are implemented in different kernels, the amount of reproduced code will grow. Rich meaning at the data level will help give structure to cross-language constructs.

For example, adopting a rich meaning for fields within WidgetModel subclasses would allow for better reuse of domain concepts. Consider a date field:

  • at its simplest, it could be simply said to have a specific format, xsd:date, i.e. ISO-8601
  • a stronger concept could even specify is a schema:startDate.

IPEP 17: Not far enough?

As noted above, while Notebook Format 4 represents a significant move forward, some remaining issues arise in the content contained therein: .ipynb files do not contain a reference to their schema, such as:

  • explicitly, e.g. a $schema key
  • implictly, e.g. served with an HTTP header (e.g. Content-Type or Link)

As such, if found in the wild, a manual step of dereferencing nbformat to the source repository would be required to find documentation of its content. Community users of master have already discovered this.

Linked Data Formats

What is a notebook? Depending on whom you ask, it is at least:

  • to a desktop user, a .ipynb file on their hard drive
  • to contents manager, a JSON-compatible object
  • to nbconvert, a static HTML document
  • to nbviewer, a URL a user can click on and share

Of these formats, some lend themselves to being enhanced with linked data more than others. Of particular note are:

  • HTML: RDFa is a W3C standard for inline linked data annotation of an HTML tree
  • JSON: JSON-LD is a W3C standard for inline or out-of-band annotation of a JSON document

Adoption of these formats represents the best way towards making the content contained in the body of knowledge created by the Community more broadly discoverable. Adoption can be gradual, with each successive step providing additional features and content.

Linked Data Concepts

Meaning, in this prposal, means that with the data alone or a suitable reference to the data, it can be unambiguously: nbformat: 4 is not execution_count: 4. The core elements of meaning addressed in this proposal are:

Context

The proposed means of reusing meaning is JSON-LD contexts. Contexts provide a modular, opt-in, potentially-out-of-band means for capturing the intent of a piece of data.

To claim that some JSON can be interpreted with meaning:

  • a JSON document i.e. #/ can
    • include a @context, i.e. #/@context
    • be served or embedded with a Link header:
Link: <http://ipython.org/contexts/notebook.jsonld>; rel="http://www.w3.org/ns/json-ld#context"; type="application/ld+json"
  • any JSON object, i.e. #/metadata can
    • include a @context, i.e. #/metadata/@context
      • this will override any parent @context, however it was defined

Identity

Is some piece of data addressable in an unambiguous way? Concretely, this means that a JSON object, either through its position in a document, or through use of a keyword, has an Universal Resource Identifier, or URI. While URIs share many characteristics with their more functional twins, URLs, URIs are not neccessarily dereferenceable: this is actually beneficial, as they don't need to be migrated, hosted or otherwise maintained, just agreed upon.

Examples of things in the Ecosystem that could benefit from having explicit identity:

  • Notebook Format 4
    • canonical URI: http://ipython.org/ns/nbformat/v4
    • compact URI: nbf:4

Type

Does some piece of data share properties with other pieces of data? Type is the convenient bundling of possible properties that help a user make sense of data they find.

Examples of things in the Ecosystem that could benefit from having explicit type:

Natural Language

While the Community is diverse and multi-lingual, the software, development process and documentation that represents the Ecosystem remains anglophilic and pythonic.

Users of the notebook, however, are already using it to publish content in many languages, both machine and natural. While kernel agnosticism is one of the current challenges, natural language agnosticism will eventually become a feature, as publishing can occur in all manner of languages. Instead of inventing new syntax for capturing this, we can adopt a single, standards-based representation.

JSON-LD provides a means for using consistent codes for:

  • specifying the default language of a document
    application/ld+json
      {
        "@context": {
            "@language": "en"
        },
        "cells": [
          {
            "cell_type": "markdown",
            "source": [
              "Some English\n"
            ]
          },
        ]
  • much more invasively, storing multiple natural language representations in the same key
    application/ld+json
      {
        ...
        {
          "cell_type": "markdown",
           "source": [
             {
               "@language": "en",
               "@value": [
                 "Some English\n"
               ]
             },
             {
               "@language": "de",
               "@value": [
                 "Etwas Deutsch\n"
               ]
             }
           ]
        },

Implementation Challenges & Opportunities

Upstream Dependency Support

Highlighting

New Support Libraries

While it is unneccessary to de-reference a JSON-LD document into a more graph-like form, at times it will be useful. Having one of the known implementations in the Ecosystem languages will be key to useful features being developed that make use of Linked Data.

Interpretation

Roadmap

A lightweight @context for Notebook Format 4

As part of this discussion, the following lightweight context was proposed:

application/ld+json
{
  "@context": {
    "@vocab": "http://ipython.org/nbformat/v4/",
    "nb4": "http://ipython.org/nbformat/v4/",
    "xsd": "http://www.w3.org/2001/XMLSchema#",
    "foaf": "http://xmlns.com/foaf/0.1/",    
    "language": {"@type": "@id"},
    "codemirror_mode": {"@type": "@id"},
    "cell_type": {"@id": "@type"},
    "output_type": {"@id": "@type"},
    "cells": {"@container": "@list"},
    "source": {"@container": "@list"},
    "outputs": {"@container": "@list"},
    "text": {"@container": "@list"},
    "traceback": {"@container": "@list"},
    "tags": {"@container": "@set"},
    "collapsed": {"@type": "xsd:boolean"},
    "execution_count": {"@type": "xsd:int"},
    "nbformat_minor": {"@type": "xsd:int"},
    "nbformat": {"@type": "xsd:int"},
    "signature": {"@type": "foaf:sha1"},
    "image/svg+xml": {"@container": "@list"},
    "image/png": {"@container": "@list"},
    "text/html": {"@container": "@list"},
    "text/plain": {"@container": "@list"},
    "application/javascript": {"@container": "@list"}
  }
}

Serving this as an out-of-band context for the raw notebook REST service and nbviewer download would immediately create some value.

Create a JSONLDExporter for nbcovert

Being able to export a self-contained and -describing, machine-readable document from a notebook would be a good step in enabling downstream use of the data stored in notebooks.

Beyond the baseline of what could be captured with a lightweight notebook @context, a dedicated exporter could provide additional advantages:

  • embed the full @context, making the meaning in the documents unambiguous and portable
  • extract links in Markdown cells

Add --include-rdfa to HTMLExporter

Either as a separate output of the notebook export process, HTML can contain in-line Linked Data attributes, using the RDFa notation. Thus, when a cell or image is exported, metadata could be provided to make the document content more understandable to external agents.

Expose Linked Data in nbviewer

As nbconvert drives what nbviewer can display, a natural step would be to advertise and provide notebooks from across the web in a linked data format:

http://nbviewer.ipython.org/as/jsonld/github/ipython/ipython/blob/2.x/examples/Index.ipynb
http://nbviewer.ipython.org/as/rdfa/github/ipython/ipython/blob/2.x/examples/Index.ipynb

FUTURE: Treat notebook metadata as a compacted JSON-LD document

In the canonical Notebook front-end (the JavaScript UI), cell and notebook metadata is still the "wild west". "Tamed" with an explicit context, and treating the output of metadata as a compacted JSON-LD document, metadata could become the

Example

Slideshow metadata:

{
  "@context": {
    "slides": "http://ipython.org/formats#slides"
  },
  "cells": [
    {
      "metadata": {
        "slides:type": "slides:slide"
      }
    }
  ]
}
  • Research Object Bundle

    This specification defines RO Bundle, a ZIP-based file format that bundles resources which when aggregated form an identifiable conceptual work; say a collection of datasets resulting from a scientific experiment, or a gathering of logs and outputs from a particular command line execution.