Notebook Profiles¶

This exploratory coding notebook explores several techniques to support the static profiling of Jupyter notebooks as texts, reporting on various metrics, including:

notebook size (markdown and code line counts);
readability scores;
reading time estimates;
code complexity and maintability.

The motivating context was a tool for generating summary reports on the estimated workload associated with 100 or so notebooks over 25 or so directories (1 directory / 4 notebooks per week) for a third year undergraduate equivalent Open University course on data management and analysis.

Previous notebook recipes include generating simple reports that pull out headings from notebooks to act as notebook summaries (eg Get Contents). Such recipes may provide a useful component in a notebook quality report if the report is also intended to provide a summary / overview of notebooks. (It might be most useful to offer heading summaries as an option in a notebook profiling report?)

Tools supporting the profiling of one or more notebooks across one or more directories and the generation of simple statistics over them are also provided.

The profiler is also capabale of running simple health checks over a notebook, for example reporting on:

whether code cells have been executed, and if so, whether code cell execution in complete and in linear order;
packages / modules loaded in to the notebook;
unused code items in a notebook (for example, modules loaded but not used).

Currently, code profiling is only applied to code that appears in code cells, not code that is quoted or described in markdown cells.

There is a potential for making IPython magics for some of the reporting functions (for example, radon or wily reports) to provide live feedback / reporting during the creation of content in a notebook.

Notebooks¶

In the first instance, reports are generated for code cell inputs and markdown cells; code outputs and raw cells are not considered. Code appearing in markdown cells is identified as code-like but not analysed in terms of code complexity etc.

For each markdown cell, we can generate a wide range of simple text document statistics. Several packages exist to support such analyses (for example, textstat, readability) but the focus in this notebook will be on metrics derived using the spacy underpinned textacy package for things like readability metrics. Several simple custom metrics are also suggested.

For code in code cells, the radon package is used to generate code metrics, with additional packages providing further simple metrics.

A test notebook is provided (Notebook_profile_test.ipynb) against which we can test various elements of this notebook.

Potential Future Work¶

In terms of analysing cell outputs (not covered as yet), reports could be generated on the sorts of asset that appear to be displayed in each cell output, whether code warnings or errors are raised, etc. There is also potential for running in association with something like nbval to test that notebooks test correctly against previously run cell outputs.

We might also explore the extent to which interactive notebook profiling tools, such as magics or notebook extensions, be used to support the authoring of new instructional notebooks.

We might also ask to what extent might interactive notebook profiling tools be used to support learners working through instructional material and reflecting on their work? Code health metrics, such as cell execution success used by nbgallery may provide clues regarding which code activity cells students struggled to get working, for example. By looking at statistics across students (for example, in assessment notebooks with cell execution success log monitoring enabled) we may be able to identify "healthy" or "unhealthy" activities; for example, a healthy activity is one in which students can get their code to run with one or two tries, an unhealthy activity is one where they make repeated attempts at trying to get the code to work as they desire.

The notebook profiler should also be runnable against notebooks created using Jupytext from markdown rendered from OU-XML. It would probably make more sense to build a custom OU-XML profiler, eg one that could perhaps draw on a summary XML doc generated from OU-XML source docs using XSLT. I'll try to bear in mind creating reporting functions that might be useable in this wider sense. (OU-XML will also have thngs like a/v components, and may have explicit time guidance on expected time spent on particular activities.)

Settings¶

The following parameters are used notebook wide in the generation of reports.

In [1]:

READING_RATE = 100 # words per minute
# What is a sensible reading rate for undergraduate level academic teaching material?
# 250 wpm gives a rate of 15,000 wph
# 10,000 wph corresponds to about 170 words per minute
# OU guidance: 35 wpm for challenging texts, 70 wpm for medium texts, 120 wpm for easy texts

CODE_READING_RATE = 35 # tokens per minute -- UNUSED

CODE_LINE_READING_TIME = 1 # time in seconds to read a code line

LINE_WIDTH = 160 #character width of a line of markdown text; used to calculate "screen lines"

CODE_CELL_REVIEW_TIME = 5 # nominal time in seconds to run each code cell / review each code cell output

CELL_SKIP_TIME = 1 # nomimal time in seconds to move from one cell to the next

Open Notebook¶

Open and read a notebook, such as the associated test notebook:

In [2]:

TEST_NOTEBOOK = 'Notebook_profile_test.ipynb'

import nbformat
with open(TEST_NOTEBOOK,'r') as f:
    nb = nbformat.reads(f.read(), as_version=4)

Analyse Markdown Cells¶

Iterate through markdown cells and generate cell by cell reports.

We can start off by generating some simple counts for a single notebook.

Let's preview the contents of a single cell:

In [3]:

nb.cells[0]

Out[3]:

{'cell_type': 'markdown',
 'metadata': {},
 'source': '# Test Notebook for Notebook Profiler\n\nThis notebook provides a test case for the notebook profiler.\n\nIt includes a range of markdown and code cells intended to test various features of the profiler.\n\nNote that this notebook does not necessarily run...'}

We can look at just the markdown component associated with a markdown cell - this will be the basis for our markdown text analysis.

In [4]:

txt = nb.cells[0]['source']

Estimates of reading time are often based on word count estimates. The Medium website use a reading time estimator that also factors in the presence of images in a text as well as wordcount / sentence length. The readtime package uses the Medium reading time estimation algorithm to give a reading time estimate.

?? TO DO - more on the reading time equation; also need something like maybe: +10s for every code cell to run it and look at output? Different reading time per line of code?

It might be worth looking at forking this reading time estimator and try to factor in reading time elements that reflect the presence of code? Or maybe use a slower reading rate for code? Or factor in code complexity? The presence of links might also affect reading time.

In [5]:

#https://github.com/alanhamlett/readtime
#%pip install readtime

import readtime
import math

rt = readtime.of_markdown(txt, wpm=READING_RATE).delta.total_seconds()

#Round up on the conversion of estimated reading time in seonds, to minutes...
f'Reading time in seconds: {rt}; in minutes: {math.ceil(rt/60)}.'

Out[5]:

'Reading time in seconds: 25.0; in minutes: 1.'

The spacy natural language processing package provides a wide ranging of basic tools for parsing texts.

In [6]:

#%pip install spacy
import spacy

#Check we have the small English model at least
SPACY_LANG_MODEL = 'en_core_web_sm'

try:
    import en_core_web_sm
except:
    import spacy.cli
    spacy.cli.download(SPACY_LANG_MODEL)

#Load a model that a text is parsed against
nlp = spacy.load(SPACY_LANG_MODEL)

To call on spacy, we need to create tokenised document representation of the text (conveniently, the original text version is also stored as part of the object).

In [7]:

doc = nlp(txt)

The textacy package builds on spacy to provide a range of higher level tools and statistics, from simple statistics such as word and sentence counts to more complex readability scores using a variety of readability measures.

One way of using readability measures would be to set reading rates dynamically for each markdown cell based on calculated readability scores.

In [8]:

#%pip install textacy
from textacy import TextStats

ts = TextStats(doc)
ts.basic_counts, ts.readability_stats

Out[8]:

({'n_sents': 4,
  'n_words': 40,
  'n_chars': 203,
  'n_syllables': 63,
  'n_unique_words': 27,
  'n_long_words': 15,
  'n_monosyllable_words': 25,
  'n_polysyllable_words': 6},
 {'flesch_kincaid_grade_level': 6.895,
  'flesch_reading_ease': 63.440000000000026,
  'smog_index': 10.125756701596842,
  'gunning_fog_index': 10.0,
  'coleman_liau_index': 11.080711825000005,
  'automated_readability_index': 7.47325,
  'lix': 47.5,
  'gulpease_index': 68.25,
  'wiener_sachtextformel': 6.5195})

The textacy package can also pull out notable features in a text, such as key terms or acronyms, both of which may be useful as part of a notebook summary.

In [9]:

#Extract keyterms
import textacy.ke
textacy.ke.textrank(doc, normalize="lemma", topn=10)

Out[9]:

[('notebook profiler', 0.08196495093971548),
 ('test case', 0.06744856661263204),
 ('Test Notebook', 0.06479107591582292),
 ('code cell', 0.05486312750180375),
 ('markdown', 0.024974748258550644),
 ('feature', 0.023809657889882128),
 ('range', 0.022746625242650347)]

In [10]:

from textacy.extract import acronyms_and_definitions
acronyms_and_definitions(doc)

Out[10]:

{}

As well as using measures provided by off-the-shelf packages, it's also useful to define some simple metrics of our own that don't appear in other packages.

To start with, let's try to estimate the notebook length as it appears on screen by calculating how many "screen lines" a markdown cell is likely to take up. This can be calculated by splitting long lines of text over multiple lines based on a screen line width parameter.

In [11]:

import textwrap

def _count_screen_lines(txt, width=LINE_WIDTH):
    """Count the number of screen lines that a markdown cell takes up."""
    ll = txt.split('\n\n')
    _ll = []
    for l in ll:
        #Model screen flow: split a line if it is more than `width` characters long
        _ll=_ll+textwrap.wrap(l, width)
    n_screen_lines = len(_ll)
    return n_screen_lines

In [81]:

screen_txt='As well as "text", markdown cells may contain cell blocks. The following is a basic report generator for summarising key statistical properties of code blocks. (We will see later an alternative way of calculating such metrics for well form Python code at least.)'
_count_screen_lines(screen_txt)

Out[81]:

The textacy package does not appear to provide average sentence length statistics (although sentence length metrics may play a role in calculating readability scores? So maybe there are usable functions somewhere in there?) but we can straightforwardly define our own.

In [13]:

import statistics

def sentence_lengths(doc):
    """Generate elementary sentence length statistics."""
    s_mean = None
    s_median = None
    s_stdev = None
    s_lengths = []
    for sentence in doc.sents:
        #Punctuation elements are tokens in their own right; remove these from sentence length counts
        s_lengths.append(len( [tok.text for tok in sentence if tok.pos_ != "PUNCT"]))
        
    if s_lengths:
        #If we have at least one measure, we can generate some simple statistics
        s_mean = statistics.mean(s_lengths)
        s_median = statistics.median(s_lengths)
        s_stdev = statistics.stdev(s_lengths) if len(s_lengths) > 1 else 0
        
    return s_lengths, s_mean, s_median, s_stdev

The sentence statistics are generated from a spacy doc object and returned as separate statistics.

In [14]:

s_lengths, s_mean, s_median, s_stdev = sentence_lengths(doc)
print(s_lengths, s_mean, s_median, s_stdev)

[7, 11, 18, 8] 11 9.5 4.96655480858378

As well as "text", markdown cells may contain cell blocks. The following is a basic report generator for summarising key statistical propererties of code blocks. (We will see later an alternative way of calculating such metrics for well form Python code at least.)

In [15]:

def _code_block_summarise(lines,
                          n_blank_code_lines = 0,
                          n_single_line_comment_code_lines = 0,
                          n_code_lines = 0):
    
    lines = lines.splitlines() if isinstance(lines, str) else lines
    
    #if lines[0].startwsith('%%'): 
        ##block magic - we could detect which?
        #This would let us report on standard block magic such as %%bash
        #as well as custom magic such as %%sql
    for l in lines:
        if not l.strip():
            n_blank_code_lines = n_blank_code_lines + 1
        elif l.strip().startswith(('#')): #Also pattern match """.+""" and '''.+'''
            n_single_line_comment_code_lines = n_single_line_comment_code_lines + 1
            #How should we detect block comments?
        #elif l.strip().startswith(('!')):
            ## IPyhton shell command
        #elif l.startswith('%load_ext'):
            ##Import some magic - we could detect which?
        else:
            n_code_lines = n_code_lines + 1
    return n_blank_code_lines, n_single_line_comment_code_lines, n_code_lines

We can use the code block summary in a more general report on "features" within a markdown cell (sentence statistics are handled elsewhere):

In [16]:

def _report_md_features(txt):
    """Report on features in markdown documents.
        For example, number of headings or paragraphs, or code block analysis."""
    n_headers = 0
    n_paras = 0
    n_total_code_lines = 0
    n_code_lines = 0
    n_blank_code_lines = 0
    n_single_line_comment_code_lines = 0

    in_code_block = False
    
    n_screen_lines = _count_screen_lines(txt)
                        
    #Markdown processor ignores whitespace at start and end of a markdown cell
    txt = txt.strip()
    
    n_code_blocks = 0
    
    #We will see how to improve the handling of code blocks in markdown cells later
    for l in txt.split('\n'):
        if l.strip().startswith('```'):
            in_code_block = not in_code_block
            if in_code_block:
                n_code_blocks = n_code_blocks + 1
        elif in_code_block:
            n_total_code_lines = n_total_code_lines + 1
            n_blank_code_lines, n_single_line_comment_code_lines, \
                n_code_lines = _code_block_summarise(l,
                                                     n_blank_code_lines,
                                                     n_single_line_comment_code_lines,
                                                     n_code_lines)
        elif l.startswith('#'):
            #Markdown heading
            n_headers = n_headers + 1
        elif not l.strip():
            #A paragraph is identified by an double end of line (\n\n) outside a code block
            #So if we have an empty line that signifies a paragraph break?
            n_paras = n_paras + 1
            
    n_code = (n_total_code_lines, n_code_lines, \
              n_blank_code_lines, n_single_line_comment_code_lines)
    
    return n_headers, n_paras, n_screen_lines, n_code_blocks, n_code

So for example, the features we can report on might include the number of headings paragraphs, screen lines, or code block features.

In [17]:

n_headers, n_paras, n_screen_lines, n_code_blocks, n_code = _report_md_features(txt)
n_headers, n_paras, n_screen_lines, n_code_blocks, n_code

Out[17]:

(1, 3, 4, 0, (0, 0, 0, 0))

In [18]:

(n_total_code_lines, n_code_lines, n_blank_code_lines, n_single_line_comment_code_lines) = n_code
n_total_code_lines, n_code_lines, n_blank_code_lines, n_single_line_comment_code_lines

Out[18]:

(0, 0, 0, 0)

Code Blocks in Markdown Cells¶

A question arises when we have code blocks appearing in markdown cells. How should these be treated? Should we report the code toward markdown counts, or should we separately treat the code, discounting it from markdown word counts but reporting it as "code in markdown"?

Another approach might be to include and codes of block appearing in markdown cells as part of the markdown word count, but provide an additional report identifying how many lines of code appeared as part of the markdown.

The excode package provides an easy way of grabbing code blocks from markdown text, so we might be able to use that to mprove the handling of code blocks inside markdown cells.

Lets grab a simple text case of some markdown containing some code blocks:

In [19]:

mc = nb.cells[2]['source']
print(mc)

This cell contains two code blocks.

Here's one:

```python
import pandas

#Create a dataframe
df = pd.DataFrame()
```

and here's another:

```python
import pandas

#Create a dataframe
df = pd.DataFrame()
```

So that's two...

Let's see if we can extract those code blocks...

In [20]:

#%pip install excode
import excode
import io

#excode seems to expect a file buffer...
excode.extract(io.StringIO(mc))

Out[20]:

['import pandas\n\n#Create a dataframe\ndf = pd.DataFrame()\n',
 'import pandas\n\n#Create a dataframe\ndf = pd.DataFrame()\n']

We can now report on the structure of code blocks in markdown cells more directly:

In [21]:

def code_block_report(c):
    """Generate simple code report when passed a list of code lines
        or a string containing multiple `\n` separated code lines."""
    
    n_total_code_lines = 0
    n_code_lines = 0
    n_blank_code_lines = 0
    n_single_line_comment_code_lines = 0
    
    #We won't count leading or lagging empty lines as code lines...
    lines = c.strip().splitlines() if isinstance(c, str) else c
    
    #If first or last line is empty, strip it
    if len(lines) > 1:
        lines = lines[1:] if not lines[0].strip() else lines
        lines = lines[:-1] if not lines[-1].strip() else lines
    
    #print(lines)
    
    n_total_code_lines = len(lines)
    
    n_blank_code_lines, n_single_line_comment_code_lines, \
        n_code_lines = _code_block_summarise(lines,
                                             n_blank_code_lines,
                                             n_single_line_comment_code_lines,
                                             n_code_lines)
    
    return (n_total_code_lines, n_blank_code_lines,\
            n_single_line_comment_code_lines, n_code_lines)

Running the above function should generate some simple code statistics:

In [22]:

print(mc)
for c in excode.extract(io.StringIO(mc)):
    (n_total_code_lines, n_blank_code_lines, \
         n_single_line_comment_code_lines, n_code_lines) = code_block_report(c)
    
    print(n_total_code_lines, n_blank_code_lines, \
          n_single_line_comment_code_lines, n_code_lines )

This cell contains two code blocks.

Here's one:

```python
import pandas

#Create a dataframe
df = pd.DataFrame()
```

and here's another:

```python
import pandas

#Create a dataframe
df = pd.DataFrame()
```

So that's two...
4 1 1 2
4 1 1 2

We could also use the radon code analyser (which does count empty lines as code lines unless we explictly strip them).

However, it should be noted that the radon code analysis relies on well formed Python code that can be loaded as into the Python AST parser. This means that code that doesn't parse as valid Python, either because it contains an error or because the code is not actually Python code (for example, in course materials we make use of SQL block magic to allow us to write SQL code in a code cell).

The radon parser will also report an error if it comes across IPython line or cell magic code, or ! prefixed shell commands.

We will see later how we can start to cleanse a code string of IPython ! and % prefixed directives when we consider parsing code cells.

In [23]:

#%pip install radon
from radon.raw import analyze
for c in excode.extract(io.StringIO(mc)):
    r = analyze(c.strip())
    print(r)
r.loc, r.lloc, r.sloc, r.comments, r.multi, r.blank, r.single_comments

Module(loc=4, lloc=2, sloc=2, comments=1, multi=0, blank=1, single_comments=1)
Module(loc=4, lloc=2, sloc=2, comments=1, multi=0, blank=1, single_comments=1)

Out[23]:

(4, 2, 2, 1, 0, 1, 1)

We can bundle up the radon analyzer to make it a little easier to call for our purposes:

In [24]:

def r_analyze(c):
    """Analyse a code string using radon.analyze."""
    r = analyze(c.strip())
    n_total_code_lines = r.loc
    n_blank_code_lines = r.blank
    n_single_line_comment_code_lines = r.comments
    n_code_lines = r.sloc
    return (n_total_code_lines, n_blank_code_lines, \
            n_single_line_comment_code_lines, n_code_lines)

We can then siple call r_analyze() function with a code string:

In [25]:

for c in excode.extract(io.StringIO(mc)):
    (n_total_code_lines, n_blank_code_lines, \
         n_single_line_comment_code_lines, n_code_lines) = r_analyze(c)
    
    print(n_total_code_lines, n_blank_code_lines, \
          n_single_line_comment_code_lines, n_code_lines)

4 1 1 2
4 1 1 2

Code Reading (and Execution) Time¶

It would be useful if we had a heuristic for code reading time.

One approach would be to tokenise the code and estimate reading time from a simple "tokens per minute" reading rate, or use a reading rate appropriate for "difficult" text. Another approach might be to try to make use of code complexity scores and code length.

A pragmatic way may just be to estimate based on lines of code, with a nominal reading time allocated to each line of code.

In [26]:

def code_reading_time(n_code_lines, n_single_line_comment_code_lines, line_time=CODE_LINE_READING_TIME):
    """Crude reading time estimate for a code block."""
    code_reading_time = line_time * (n_code_lines + n_single_line_comment_code_lines)
    return code_reading_time

The way we currently process code in markdown cells, it will be timed at the standard reading rate. It may be appropriate to add a simple modifier that also adds a "code reading overhead" to the reading time based on the amount of code in a markdown cell.

For code in code cells, rather than code blocks in markdown cells, might also be worth exploring code execution time, that is, an overhead associated with running each code cell. A crude way of calculating this would be to levy a fixed amount of time to account for running the code cell and inspecting the result. A more considered approach would look to cell profiling / execution time logs and code cell outputs in a run notebook.

Custom Report Aggregator¶

For convenience, we can bundle up the custom metrics we have created into a function that returns a single report object.

In [27]:

import math

def process_extras(doc):
    """Generate a dict containing additional metrics."""
    
    n_headers, n_paras, n_screen_lines, n_code_blocks, n_code = _report_md_features(doc.text)
    s_lengths, s_mean, s_median, s_stdev = sentence_lengths(doc)
    (n_total_code_lines, n_code_lines, n_blank_code_lines, n_single_line_comment_code_lines) = n_code
    
    _reading_time = readtime.of_markdown(doc.text, wpm=READING_RATE).delta.total_seconds()
    #Add reading time overhead for code
    line_of_code_overhead = 1 #time in seconds to add to reading of each code line
    _reading_time = _reading_time + code_reading_time(n_code_lines, n_single_line_comment_code_lines,
                                                     line_of_code_overhead)
    
    extras = {'n_headers':n_headers,
              'n_paras':n_paras,
              'n_screen_lines':n_screen_lines,
              's_lengths':s_lengths,
              's_mean':s_mean,
              's_median':s_median,
              's_stdev':s_stdev,
              'n_code_blocks':n_code_blocks,
              'n_total_code_lines':n_total_code_lines,
              'n_code_lines':n_code_lines,
              'n_blank_code_lines':n_blank_code_lines,
              'n_single_line_comment_code_lines':n_single_line_comment_code_lines,
              'reading_time_s':_reading_time,
              'reading_time_mins': math.ceil(_reading_time/60),
              'mean_sentence_length': s_mean,
              'median_sentence_length': s_median,
              'stdev_sentence_length': s_stdev,
              #The following are both listy, so we need to handle them when we move to a dataframe
              # TO DO  - paramterise the number of key terms
              'keyterms':textacy.ke.textrank(doc, normalize="lemma", topn=10),
              'acronyms':acronyms_and_definitions(doc)
             }
    return extras

Running the process_extras() function on a doc object returns the extra metrics as keyed items in a single dict:

In [28]:

ts.basic_counts, ts.readability_stats, process_extras(doc)

Out[28]:

({'n_sents': 4,
  'n_words': 40,
  'n_chars': 203,
  'n_syllables': 63,
  'n_unique_words': 27,
  'n_long_words': 15,
  'n_monosyllable_words': 25,
  'n_polysyllable_words': 6},
 {'flesch_kincaid_grade_level': 6.895,
  'flesch_reading_ease': 63.440000000000026,
  'smog_index': 10.125756701596842,
  'gunning_fog_index': 10.0,
  'coleman_liau_index': 11.080711825000005,
  'automated_readability_index': 7.47325,
  'lix': 47.5,
  'gulpease_index': 68.25,
  'wiener_sachtextformel': 6.5195},
 {'n_headers': 1,
  'n_paras': 3,
  'n_screen_lines': 4,
  's_lengths': [7, 11, 18, 8],
  's_mean': 11,
  's_median': 9.5,
  's_stdev': 4.96655480858378,
  'n_code_blocks': 0,
  'n_total_code_lines': 0,
  'n_code_lines': 0,
  'n_blank_code_lines': 0,
  'n_single_line_comment_code_lines': 0,
  'reading_time_s': 25.0,
  'reading_time_mins': 1,
  'mean_sentence_length': 11,
  'median_sentence_length': 9.5,
  'stdev_sentence_length': 4.96655480858378,
  'keyterms': [('notebook profiler', 0.08196495093971548),
   ('test case', 0.06744856661263204),
   ('Test Notebook', 0.06479107591582292),
   ('code cell', 0.05486312750180375),
   ('markdown', 0.024974748258550644),
   ('feature', 0.023809657889882128),
   ('range', 0.022746625242650347)],
  'acronyms': {}})

Generate a Whole Notebook Markdown Report¶

The whole notebook report can come in various flavours:

top level summary statistics that merge all the markdown content into a single cell and then analyse that;
aggregated cell level statistics that summarise the statistics calculated for each markdown cell separately;
individual cell level statistics that report the statistics for each cell separately.

Whilst the individual cell level statistics presented in a textual fashion may be overkill, it may be useful to generate visual displays of a notebook that graphically summarise its structure.

Top-Level Summary¶

Let's start with a report that munges the all the markdown text together and report on that...

In [29]:

def process_notebook_full_md(nb):
    """Given a notebook, return all the markdown cell content as one string,
        and all the code cell content as another string."""
    txt = []
    code = []
    for cell in nb.cells:
        if cell['cell_type']=='markdown':
            txt.append(cell['source'])
        elif cell['cell_type']=='code':
            code.append( cell['source'])

    doc = nlp('\n\n'.join(txt))
    code = '\n\n'.join(code)
    
    return doc, code

The process_notebook_full_md() function takes a notebook object and returns two strings: one containing all the notebook's markdown cell content, one containing all its code cell content.

In [30]:

full_doc, full_code = process_notebook_full_md(nb)
full_doc.text[:280], full_code[:250]

Out[30]:

('# Test Notebook for Notebook Profiler\n\nThis notebook provides a test case for the notebook profiler.\n\nIt includes a range of markdown and code cells intended to test various features of the profiler.\n\nNote that this notebook does not necessarily run...\n\n## Markdown Cells With Cod',
 '# This is a code cell\nimport pandas\n\n#Create a dataframe\ndf = pd.DataFrame()\n\n# This is a code cell with a magic...\n\n%matplotlib inline\nimport time\n\ndef fn():\n    """How is the docstring handled?"""\n    pass\n\n%load_ext sql\n\n%%sql\nSELECT * FROM TABLE;')

Let's make things a bit more tabular in our reporting:

In [31]:

import pandas as pd

def process_notebook_md_doc(doc):
    ts = TextStats(doc)
    return pd.DataFrame([{'text':doc.text,
                          **ts.basic_counts, **ts.readability_stats, **process_extras(doc)}])

Running the process_notebook_md_doc() function on a doc object returns a single row dataframe containing summary statistics calculated over the full markdown content of the notebook.

In [32]:

process_notebook_md_doc(full_doc)

Out[32]:

	text	n_sents	n_words	n_chars	n_syllables	n_unique_words	n_long_words	n_monosyllable_words	n_polysyllable_words	flesch_kincaid_grade_level	...	n_code_lines	n_blank_code_lines	n_single_line_comment_code_lines	reading_time_s	reading_time_mins	mean_sentence_length	median_sentence_length	stdev_sentence_length	keyterms	acronyms
0	# Test Notebook for Notebook Profiler\n\nThis ...	15	119	499	159	49	26	89	8	3.270387	...	6	0	3	69.0	2	8.733333	8	5.417784	[(single code block, 0.05399890062211835), (co...	{}

1 rows × 37 columns

Summarised Cell Level Reporting¶

For the summarised cell level reporting, generate measures on a per cell basis and then calculate summary statistics over those.

In [72]:

def process_notebook_md(nb, fn=''):
    """Process all the markdown cells in a notebook."""
    cell_reports = pd.DataFrame()
    
    for i, cell in enumerate(nb.cells):
        if cell['cell_type']=='markdown':
            _metrics = process_notebook_md_doc( nlp( cell['source'] ))
            _metrics['cell_count'] = i
            _metrics['cell_type'] = 'md'
            cell_reports = cell_reports.append(_metrics, sort=False)
    
    cell_reports['filename'] = fn
    cell_reports.reset_index(drop=True, inplace=True)
    return cell_reports

Processing a single notebook returns a dataframe with one row per markdown cell with each metric reported in its own column.

In [73]:

total_report = process_notebook_md(nb)
total_report.head(3)

Out[73]:

	text	n_sents	n_words	n_chars	n_syllables	n_unique_words	n_long_words	n_monosyllable_words	n_polysyllable_words	flesch_kincaid_grade_level	...	reading_time_s	reading_time_mins	mean_sentence_length	median_sentence_length	stdev_sentence_length	keyterms	acronyms	cell_count	cell_type
0	# Test Notebook for Notebook Profiler\n\nThis ...	4	40	203	63	27	15	25	6	6.895000	...	25.0	1	11.00	9.5	4.966555	[(notebook profiler, 0.08196495093971548), (te...	{}	0	md
1	## Markdown Cells With Code Blocks\n\nThis cel...	4	30	123	38	22	5	23	1	2.281667	...	18.0	1	8.25	5.5	8.261356	[(single code block, 0.09825762538579677), (Ma...	{}	1	md
2	This cell contains two code blocks.\n\nHere's ...	8	49	173	58	23	6	41	1	0.766097	...	28.0	1	6.50	6.0	4.105745	[(code block, 0.052250174985765105), (import p...	{}	2	md

3 rows × 40 columns

It is trivial to create summary statistics directly from the per cell report table by aggregating over rows associated with the same notebook; in this case, we can find the total readtime as a simple sum.

However, more generally we may wish to apply the aggegation over a set of grouped results (for example, in a dataframe containing materics from mutliple notebooks, we would want to group by each notebook and then perform the agggragatin on the measures associated with each notebook).

In [35]:

total_report['reading_time_mins'].sum()

Out[35]:

Let's also create a function to profile a notebook from a file:

In [36]:

def process_notebook_file(fn):
    """Grab cell level statistics across a whole notebook."""
    
    with open(fn,'r') as f:
        try:
            nb = nbformat.reads(f.read(), as_version=4)
            cell_reports = process_notebook_md(nb, fn=fn)
        except:
            print(f'FAILED to process {fn}')
            cell_reports = pd.DataFrame()
        
        cell_reports.reset_index(drop=True, inplace=True)
        return cell_reports

The process_notbook_file() function returns a dataframe containing row level reports for each markdown cell in a specified notebook:

In [37]:

process_notebook_file(TEST_NOTEBOOK)

Out[37]:

	text	n_sents	n_words	n_chars	n_syllables	n_unique_words	n_long_words	n_monosyllable_words	n_polysyllable_words	flesch_kincaid_grade_level	...	reading_time_s	reading_time_mins	mean_sentence_length	median_sentence_length	stdev_sentence_length	keyterms	acronyms	cell_count	cell_type	filename
0	# Test Notebook for Notebook Profiler\n\nThis ...	4	40	203	63	27	15	25	6	6.895000	...	25.0	1	11.00	9.5	4.966555	[(notebook profiler, 0.08196495093971548), (te...	{}	0	md	Notebook_profile_test.ipynb
1	## Markdown Cells With Code Blocks\n\nThis cel...	4	30	123	38	22	5	23	1	2.281667	...	18.0	1	8.25	5.5	8.261356	[(single code block, 0.09825762538579677), (Ma...	{}	1	md	Notebook_profile_test.ipynb
2	This cell contains two code blocks.\n\nHere's ...	8	49	173	58	23	6	41	1	0.766097	...	28.0	1	6.50	6.0	4.105745	[(code block, 0.052250174985765105), (import p...	{}	2	md	Notebook_profile_test.ipynb

3 rows × 40 columns

Analysing Multiple Notebooks in the Same Directory¶

As well as analysing notebooks at the notebook level, we may also want to generate individual and aggregated reports for all the notebooks contained in a single directory.

Aggregated reports might include the total estimated time to work through all the notebooks in the directory, for example.

It might be useful to have one entry point and a switch that selects between the notebook summary reports and the full cell level report? Or maybe we should report two dataframes always - aggregated notebook level and individual cell level?

In [38]:

import os

def _nb_dir_file_profiler(path, _f, report=False):
    """Get the profile for a single file on a specified path."""
    f = os.path.join(path, _f)
    if f.endswith('.ipynb'):
        if report:
            print(f'Profiling {f}')
        return process_notebook_file(f)
    return pd.DataFrame()
    
def nb_dir_profiler(path):
    """Profile all the notebooks in a specific directory."""
    nb_dir_report = pd.DataFrame()
    for _f in os.listdir(path):
        nb_dir_report = nb_dir_report.append( _nb_dir_profiler(path, _f), sort=False )
    #nb_dir_report['path'] = path
    return nb_dir_report   

In [39]:

#nb_dir_profiler('.')

Analysing Notebooks Across Multiple Directories¶

As well as analysing all the notebooks contained within a single directory, we may want to automate the production of reports at the directory level across multiple directories.

In [185]:

def nb_multidir_profiler(path, exclude = 'default'):
    """Profile all the notebooks in a specific directory and in any child directories."""
    
    if exclude == 'default':
        exclude_paths = ['.ipynb_checkpoints', '.git', '.ipynb', '__MACOSX']
    else:
        #If we set exclude, we need to pass it as a list
        exclude_paths = exclude
    nb_multidir_report = pd.DataFrame()
    for _path, dirs, files in os.walk(path):
        #Start walking...
        #If we're in a directory that is not excluded...
        if not set(exclude_paths).intersection(set(_path.split('/'))):
            #Profile that directory...
            nb_dir_report = pd.DataFrame()
            for _f in files:
                nb_dir_report = nb_dir_report.append( _nb_dir_file_profiler(_path, _f), sort=False )
            if not nb_dir_report.empty:
                nb_dir_report['path'] = _path
                nb_multidir_report = nb_multidir_report.append(nb_dir_report, sort=False)
    
    nb_multidir_report = nb_multidir_report.sort_values(by=['path', 'filename'])
    
    nb_multidir_report.reset_index(drop=True, inplace=True)
    
    return nb_multidir_report   

In [186]:

TEST_DIR = '../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks'

ddf = nb_multidir_profiler(TEST_DIR)
ddf.head()

Out[186]:

	text	n_sents	n_words	n_chars	n_syllables	n_unique_words	n_long_words	n_monosyllable_words	n_polysyllable_words	flesch_kincaid_grade_level	...	reading_time_mins	mean_sentence_length	median_sentence_length	stdev_sentence_length	keyterms	acronyms	cell_count	cell_type	filename	path
0	# The pandas library: Series and DataFrames	1	6	35	9	6	2	3	0	4.450000	...	1	7.000000	7.0	0.000000	[(Series, 0.12192605097566381), (library, 0.11...	{}	0	md	../Documents/GitHub/tm351-undercertainty/noteb...	../Documents/GitHub/tm351-undercertainty/noteb...
1	Python is a general-purpose scripting language...	10	123	570	188	79	30	75	13	7.242772	...	2	12.500000	12.0	10.058164	[(level datum structure, 0.050604038106577987)...	{}	1	md	../Documents/GitHub/tm351-undercertainty/noteb...	../Documents/GitHub/tm351-undercertainty/noteb...
2	Note there are several libraries that we shall...	3	64	338	92	49	17	45	5	9.692500	...	1	22.333333	26.0	10.016653	[(standard Python code base, 0.058213526433021...	{}	3	md	../Documents/GitHub/tm351-undercertainty/noteb...	../Documents/GitHub/tm351-undercertainty/noteb...
3	## Python recap: lists and dicts	1	5	24	6	5	0	4	0	0.520000	...	1	7.000000	7.0	0.000000	[(Python recap, 0.2923854294015616), (list, 0....	{}	4	md	../Documents/GitHub/tm351-undercertainty/noteb...	../Documents/GitHub/tm351-undercertainty/noteb...
4	Python lists are flexible, mutable, data struc...	1	18	89	28	18	6	11	3	9.785556	...	1	18.000000	18.0	0.000000	[(python list, 0.12775495473120263), (data str...	{}	5	md	../Documents/GitHub/tm351-undercertainty/noteb...	../Documents/GitHub/tm351-undercertainty/noteb...

5 rows × 41 columns

Under the grouped report, we note that the summed reading time in minutes is likely to significantly overestimate the reading time requirement, representing as it does the sum of time in minutes rounded up from seconds. The lower bound giving by the summed reading time in seconds more closely relates to the markdown word count.

However, the larger estimate perhaps does also factor in context switching time going from one cell to another. Whilst this may be invisible to the reader if a markdown cell follows a markdown cell, it may be more evident when going from a markdown cell to a code cell. On the other hand, if a markdown cell follows another because there is a change from one subsection to another, there may be a pause for reflection as part of that context switch that is captured by the rounding.

In [96]:

ddf.groupby(['path','filename'])[['n_total_code_lines','n_words',
                                  'reading_time_mins', 'reading_time_s' ]].sum()

Out[96]:

		n_total_code_lines	n_words	reading_time_mins	reading_time_s
path	filename
../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks	../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks/02.1 Pandas Dataframes.ipynb	0	1763	61	1077.0
	../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks/02.2 Data file formats.ipynb	0	171	5	107.0
	../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks/02.2.0 Data file formats - file encodings.ipynb	0	706	24	430.0
	../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks/02.2.1 Data file formats - CSV.ipynb	0	1665	41	987.0
	../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks/02.2.2 Data file formats - JSON.ipynb	0	443	17	270.0
	../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks/02.2.3 Data file formats - other.ipynb	0	825	21	499.0

In [101]:

ddf_dict = ddf.groupby(['path'])[['n_words', 'reading_time_mins',  'reading_time_s' ]].sum().to_dict(orient='index')
ddf_dict

Out[101]:

{'../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks': {'n_words': 5573,
  'reading_time_mins': 169,
  'reading_time_s': 3370.0}}

Reporting Templates¶

It's all very well having the data in a dataframe, but it could be more useful to be able to generate some written reports. So what might an example report look like?

How about something like:

In directory X there were N notebooks. The total markdown wordcount for notebooks in the directory was NN. The total number of lines of code across the notebooks was NN. The total estimated reading time across the notebooks was NN.

At the notebook level:

notebook A: markdown wordcount NN, lines of code NN, estimated reading time NN;

It might also be useful to provide simple rule (cf. linter rules) that raise warnings about notebooks that go against best practice. For example, notebooks with word counts / code line counts or reading or completion times that exceed recommended limits.

Let's start with a simple template:

In [156]:

report_template_simple_md = '''
In directory `{path}` there were {nb_count} notebooks.
The total markdown wordcount for the notebooks in the directory was {n_words} words,
with an estimated total reading time of {reading_time_mins} minutes.
'''

We can feed this from a dict containing fields required by the report template:

In [159]:

#%pip install deepmerge
from deepmerge import always_merger

report_dict = always_merger.merge(ddf_dict, notebook_counts_by_dir )
for k in report_dict:
    report_dict[k]['path'] = k
report_dict

Out[159]:

{'../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks': {'n_words': 5573,
  'reading_time_mins': 169,
  'reading_time_s': 3370.0,
  'nb_count': 6,
  'path': '../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks'}}

Feeding the dict to the template generates the report:

In [155]:

report_template_simple_md.format(**report_dict[TEST_DIR])

Out[155]:

'\nIn directory `../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks` there were 6 notebooks.\nThe total markdown wordcount for the notebooks in the directory was 5573 words,\nwith an estimated total reading time 169 minutes.\n'

Create a function to make it easier to generate the feedstocl dict:

In [190]:

def notebook_report_feedstock_md_test(ddf):
    """Create a feedstock dict for report generation. Keyed by directory path."""
    ddf_dict = ddf.groupby(['path'])[['n_words', 'reading_time_mins',  'reading_time_s' ]].sum().to_dict(orient='index')
    
    notebook_counts_by_dir = ddf.groupby(['path'])['filename'].nunique().to_dict()
    notebook_counts_by_dir = {k:{'nb_count':notebook_counts_by_dir[k]} for k in notebook_counts_by_dir}
    
    report_dict = always_merger.merge(ddf_dict, notebook_counts_by_dir )
    
    for k in report_dict:
        report_dict[k]['path'] = k
    
    return report_dict

We can now use the notebook_report_feedstock() function to generate the feedstock dict directlry from the report dataframe:

In [162]:

notebook_report_feedstock_md_test(ddf)

Out[162]:

{'../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks': {'n_words': 5573,
  'reading_time_mins': 169,
  'reading_time_s': 3370.0,
  'nb_count': 6}}

Additional Reporting Levels¶

For additional reports, we could start to look for particular grammatical constructions in the markdown text.

When it comes to looking for particular grammatical constructions in the text, the textacy package allows us to define patterns of interest in various ways. Are there any particular constructions that we may want to look out for in an instructional text?

In [44]:

import textacy

#But how do you define the pattern to extract the largest phrase over a sequence of tokens?
verb_phrase = r'(<VERB>?<ADV>*<VERB>+)' #extract.pos_regex_matches DEPRECATED

verb_phrase2 = [{"POS": "VERB", "OP":"?"}, {"POS": "ADV", "OP": "*"},
                {"POS": "VERB", "OP":"+"}] #extract.matches

verb_phrase3 = r'POS:BERB:? POS:ADV:* POS:VERB:+' #extract.matches

[vp for vp in textacy.extract.matches(doc, verb_phrase3)][:5]

Out[44]:

[provides, includes, intended, test, Note]

In [45]:

from collections import Counter
dict(Counter(([token.pos_ for token in doc])))

Out[45]:

{'SYM': 1,
 'PROPN': 4,
 'ADP': 5,
 'SPACE': 3,
 'DET': 6,
 'NOUN': 12,
 'VERB': 7,
 'PUNCT': 3,
 'PRON': 1,
 'CCONJ': 1,
 'PART': 1,
 'ADJ': 1,
 'ADV': 2}

Code Cell Analysis¶

As well as reporting on markdown cells, we can also generate reports on code cells. (We could also use similar techiques to report on code blocks found in markdown cells.)

Possible code cell reports include reporting on:

packages imported into a notebook;
number of lines of code / code comments;
code complexity.

We could also run static analysis tests over all the code loaded into a notebook, for example using things like importchecker to check that imports are actually used.

Checks against whether code cells in a notebooks: a) have been run; b) whether they have been run in order are also possible. If we extend the analysis to code cell outputs, we could also report on whether cells had been run without warning or error and what sort of output they produced.

Tools such as pyflakes can also be used to run a wider range of static tests over a codebase, as can other code linters. See also Thinking About Things That Might Be Autogradeable or Useful for Automated Marking Support for examples of tests that may be used in autograding, some of which might also be useful for notebook code profiling.

It might also be worth trying to collate possible useful guidelines / heuristics / rules of thumb for creating notebooks that could also provide the basis of quality minded linting checks.

For example:

a markdown cell should always appear before a code cells to set the context for what the code cell is expected to achieve;
a markdown cell commenting on the output of a code cell immediately preceding may be appropriate in some cases;
one cell should be defined per code cell. A markdown cell immediately following a code cell that defines a function might include a line of text that might also serve as the function doc text, describing what the function does an dprefacing a code cell that demonstrates the behaviour of the function.

Generating code reports over a single notebook¶

Let's start to put together some metrics we can run against code cells, either at an individual level or from code aggregated from across all the code cells in a notebook.

In [46]:

c='''#print\nimport pandas\n\nprint('a')\nimport abjad\nimport numpy as np\nfrom IPython.dsiplay import HTML, JSON'''

#https://github.com/andrewp-as-is/list-imports.py #list imports
#%pip install list-imports
import list_imports
list_imports.parse(c)
#Would also need to capture magics?

# TO DO  - NOT CURRENTLY REPORTED

Out[46]:

['abjad', 'numpy', 'pandas', 'IPython.dsiplay']

Some utilities may not make sense in the reporting when applied at a cell level. For example, it's quite likely that a package imported into a cell may not be used in that cell, which pyflakes would report unfavourably on:

In [47]:

#%pip install pyflakes
#pyflakes seems to print the report, so we'd need to find a way to capture it
from pyflakes.api import check
from pyflakes.reporter import Reporter

import io

output_w = io.StringIO()
output_e = io.StringIO()

check('''import pandas as pd''', 'dummy', Reporter(output_w, output_e))
output_w.getvalue()

Out[47]:

"dummy:1: 'pandas as pd' imported but unused\n"

Another form of analysis that only makes sense at the notebook level is the code cell execution analysis:

In [48]:

# Check execution across notebook - TO DO - NOT CURRENTLY REPORTED
cell_execution_order = []
num_code_cells = 0
for cell in nb.cells:
    if cell['cell_type']=='code':
        cell_execution_order.append(cell['execution_count'])
        num_code_cells = num_code_cells + 1


_executed_cells = [i for i in cell_execution_order if i is not None and isinstance(i,int) ]
in_order_execution = _executed_cells == sorted(_executed_cells)

all_cells_executed = len(_executed_cells)==num_code_cells
print(cell_execution_order, all_cells_executed, in_order_execution,)

[None, None, 1, None, None] False True

Parsing IPython Code¶

One thing to bear in mind is that code cells may contain block magic that switches code from the assumed default Python code to potentially a different language. For this reason, we might want to fall back from the radon metrics as a result of trying to load code into a Python AST parser when meeting cells that employ cell block magic, or explore whether an IPyhton parser could be used instead.

Let's try to cleanse IPython directives such as shell commands (! prefix) or magics (% prefix) from a code string so that we can present it to radon.

In [49]:

def sanitise_IPython_code(c):
    """Cleanse an IPython code string so we can parse it with radon."""
    #Comment out magic and shell commands
    c = '\n'.join([f'#{_r}' if _r.lstrip().startswith(('%','!')) else _r for _r in c.splitlines()])
    
    return c

The sanitise_IPython_code() function partially sanitises an IPython code string so that it can be passed to, and parsed by, the radon. Note that where magic or shell statements are used on the right hand side of an assignment statement, this will still cause an error.

In [50]:

#Use the `radon` analyzer
#%pip install radon
from radon.raw import analyze

c = '''%load_ext magic\nimport pandas\n\n!ls\nprint(a)'''
c = sanitise_IPython_code(c)

print(c)
n_total_code_lines, n_blank_code_lines, \
    n_single_line_comment_code_lines, n_code_lines = r_analyze(sanitise_IPython_code(c))

n_total_code_lines, n_blank_code_lines, n_single_line_comment_code_lines, n_code_lines

#%load_ext magic
import pandas

#!ls
print(a)

Out[50]:

(5, 1, 2, 2)

To parse a code cell, we can try to use the radon analyser, with a sanitised code string, or fall back to using the simpler code sanitiser. It will also be convenient to return the results as a Python dict object.

In [51]:

def robust_code_cell_analyse(c, parser='radon'):
    """Use the `radon` code analyser if we can else fall back to the simple custom code analyser."""
    
    def cleansed_radon(c):
        return r_analyze(sanitise_IPython_code(c))
    
    if c.startswith('%%'):
        #use local code analyser
        parser = 'local'

    if parser == 'radon':
        try:
            _response = cleansed_radon(c)
        except:
            #fallback to simple analyser
            _response = code_block_report(c)
    else:
        _response = code_block_report(c)
        
    (n_total_code_lines, n_blank_code_lines, \
         n_single_line_comment_code_lines, n_code_lines) = _response
    
    _reading_time = code_reading_time(n_code_lines, n_single_line_comment_code_lines)
    
    response = {
        'n_total_code_lines': n_total_code_lines,
        'n_blank_code_lines': n_blank_code_lines,
        'n_single_line_comment_code_lines': n_single_line_comment_code_lines,
        'n_code_lines': n_code_lines,
        'n_screen_lines':n_total_code_lines,
        'reading_time_s':_reading_time,
        'reading_time_mins': math.ceil(_reading_time/60)
    }
    
    return response

The robust analyser should cope with a variety of strings.

In [52]:

print(robust_code_cell_analyse('import pandas\n\n# comment\n!ls'))
print(robust_code_cell_analyse('%%sql\nSELECT * FROM TABLE'))

{'n_total_code_lines': 4, 'n_blank_code_lines': 1, 'n_single_line_comment_code_lines': 2, 'n_code_lines': 1, 'n_screen_lines': 4, 'reading_time_s': 3, 'reading_time_mins': 1}
{'n_total_code_lines': 2, 'n_blank_code_lines': 0, 'n_single_line_comment_code_lines': 0, 'n_code_lines': 2, 'n_screen_lines': 2, 'reading_time_s': 2, 'reading_time_mins': 1}

We now need to start pulling together a function that we can cal to run the basic report and other code cell reports.

In [53]:

def process_notebook_code_text(txt):
    """Generate code cell report."""
    report = pd.DataFrame()
    basic_code_report = robust_code_cell_analyse(txt)
    return pd.DataFrame([{'text':txt,
                          **basic_code_report }])

The report generates a single row report dataframe from a code string:

In [54]:

process_notebook_code_text('import pandas\n\n# comment\n!ls')

Out[54]:

	text	n_total_code_lines	n_blank_code_lines	n_single_line_comment_code_lines	n_code_lines	n_screen_lines	reading_time_s	reading_time_mins
0	import pandas\n\n# comment\n!ls	4	1	2	1	4	3	1

In order to process code cells as well as markdown cells in our notebook processer, we will need build on the process_notebook_md() function to create a more general one. Note that the current approach will give us an inefficient dataframe, column wise, in that whilst each row represents the report from a code cell or a markdown cell, the columns cover reports from both code and markdown cells.

In [223]:

def process_notebook(nb, fn=''):
    """Process all the markdown and code cells in a notebook."""
    cell_reports = pd.DataFrame()
    
    for i, cell in enumerate(nb.cells):
        if cell['cell_type']=='markdown':
            _metrics = process_notebook_md_doc( nlp( cell['source'] ))
            _metrics['cell_count'] = i
            _metrics['cell_type'] = 'md'
            cell_reports = cell_reports.append(_metrics, sort=False)
        elif cell['cell_type']=='code':
            _metrics = process_notebook_code_text(cell['source'] )
            _metrics['cell_count'] = i
            _metrics['cell_type'] = 'code'
            cell_reports = cell_reports.append(_metrics, sort=False)
        
    cell_reports['filename'] = fn
    cell_reports.reset_index(drop=True, inplace=True)
    return cell_reports

We should now be able to generate a report that includes statistics from code as well as markdown cells.

In [224]:

report = process_notebook(nb)
report.head(5)

Out[224]:

	text	n_sents	n_words	n_chars	n_syllables	n_unique_words	n_long_words	n_monosyllable_words	n_polysyllable_words	flesch_kincaid_grade_level	...	reading_time_s	reading_time_mins	mean_sentence_length	median_sentence_length	stdev_sentence_length	keyterms	acronyms	cell_count	cell_type
0	# Test Notebook for Notebook Profiler\n\nThis ...	4.0	40.0	203.0	63.0	27.0	15.0	25.0	6.0	6.895000	...	25.0	1	11.00	9.5	4.966555	[(notebook profiler, 0.08196495093971548), (te...	{}	0	md
1	## Markdown Cells With Code Blocks\n\nThis cel...	4.0	30.0	123.0	38.0	22.0	5.0	23.0	1.0	2.281667	...	18.0	1	8.25	5.5	8.261356	[(single code block, 0.09825762538579677), (Ma...	{}	1	md
2	This cell contains two code blocks.\n\nHere's ...	8.0	49.0	173.0	58.0	23.0	6.0	41.0	1.0	0.766097	...	28.0	1	6.50	6.0	4.105745	[(code block, 0.052250174985765105), (import p...	{}	2	md
3	# This is a code cell\nimport pandas\n\n#Creat...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	4.0	1	NaN	NaN	NaN	NaN	NaN	3	code
4	# This is a code cell with a magic...\n\n%matp...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	5.0	1	NaN	NaN	NaN	NaN	NaN	4	code

5 rows × 40 columns

Let's just check what columns we are potentially reporting on:

In [225]:

report.columns

Out[225]:

Index(['text', 'n_sents', 'n_words', 'n_chars', 'n_syllables',
       'n_unique_words', 'n_long_words', 'n_monosyllable_words',
       'n_polysyllable_words', 'flesch_kincaid_grade_level',
       'flesch_reading_ease', 'smog_index', 'gunning_fog_index',
       'coleman_liau_index', 'automated_readability_index', 'lix',
       'gulpease_index', 'wiener_sachtextformel', 'n_headers', 'n_paras',
       'n_screen_lines', 's_lengths', 's_mean', 's_median', 's_stdev',
       'n_code_blocks', 'n_total_code_lines', 'n_code_lines',
       'n_blank_code_lines', 'n_single_line_comment_code_lines',
       'reading_time_s', 'reading_time_mins', 'mean_sentence_length',
       'median_sentence_length', 'stdev_sentence_length', 'keyterms',
       'acronyms', 'cell_count', 'cell_type', 'filename'],
      dtype='object')

And let's see if our directory processor now also includes code cell statistics:

In [226]:

ddf2 = nb_multidir_profiler('../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks')
ddf2['cell_type'].value_counts()

Out[226]:

md      160
code    119
Name: cell_type, dtype: int64

Let's also check to see how the code cells are reported:

In [229]:

code_cols = [c for c in ddf2.columns if 'code' in c]
ddf2[ddf2['cell_type']=='code'][code_cols].sum()

Out[229]:

n_code_blocks                         0.0
n_total_code_lines                  390.0
n_code_lines                        228.0
n_blank_code_lines                   25.0
n_single_line_comment_code_lines    137.0
dtype: float64

Generating Reports Across Multiple Directories¶

We are now in a position to start generating rich report for notebooks across several directories.

Let's grab data for notebooks across an example set of directories:

In [231]:

ddf3 = nb_multidir_profiler('../Documents/GitHub/tm351-undercertainty/notebooks/tm351/')

And just quickly test we can generate a report that summarises the notebooks in each directory:

In [232]:

big_feedstock = notebook_report_feedstock_md_test(ddf3)
report_txt=''
for d in big_feedstock:
    if 'tm351/Part ' in d:
        report_txt = report_txt + '\n\n' + report_template_simple_md.format(**big_feedstock[d])
    
print(report_txt[:500])



In directory `../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 01 Notebooks` there were 5 notebooks.
The total markdown wordcount for the notebooks in the directory was 3033.0 words,
with an estimated total reading time of 143 minutes.



In directory `../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks` there were 6 notebooks.
The total markdown wordcount for the notebooks in the directory was 5573.0 words,
with an estimated total reading time of 288 minut

Let's update the report template and the report feedstock function.

First, what shall we report on?

In [210]:

ddf3.columns

Out[210]:

Index(['filename', 'text', 'n_sents', 'n_words', 'n_chars', 'n_syllables',
       'n_unique_words', 'n_long_words', 'n_monosyllable_words',
       'n_polysyllable_words', 'flesch_kincaid_grade_level',
       'flesch_reading_ease', 'smog_index', 'gunning_fog_index',
       'coleman_liau_index', 'automated_readability_index', 'lix',
       'gulpease_index', 'wiener_sachtextformel', 'n_headers', 'n_paras',
       'n_screen_lines', 's_lengths', 's_mean', 's_median', 's_stdev',
       'n_code_blocks', 'n_total_code_lines', 'n_code_lines',
       'n_blank_code_lines', 'n_single_line_comment_code_lines',
       'reading_time_s', 'reading_time_mins', 'mean_sentence_length',
       'median_sentence_length', 'stdev_sentence_length', 'keyterms',
       'acronyms', 'cell_count', 'cell_type', 'path'],
      dtype='object')

Let's make a start on a complete report template...

In [304]:

report_template_full = '''
In directory `{path}` there were {nb_count} notebooks.

- total markdown wordcount {n_words} words across {n_md_cells} markdown cells
- total code line count of {n_total_code_lines} lines of code across {n_code_cells} code cells
  - {n_code_lines} code lines, {n_single_line_comment_code_lines} comment lines and {n_blank_code_lines} blank lines

Estimated total reading time of {reading_time_mins} minutes.

'''

Now let's add those extra requirements to the the feedstock generator:

In [300]:

def notebook_report_feedstock(ddf):
    """Create a feedstock dict for report generation. Keyed by directory path."""
    ddf_dict = ddf.groupby(['path'])[['n_words', 'reading_time_mins', 'reading_time_s',
                                     'n_code_lines', 'n_single_line_comment_code_lines',
                                     'n_total_code_lines','n_blank_code_lines']].sum().to_dict(orient='index')
    
    notebook_counts_by_dir = ddf.groupby(['path'])['filename'].nunique().to_dict()
    notebook_counts_by_dir = {k:{'nb_count':notebook_counts_by_dir[k]} for k in notebook_counts_by_dir}
        
    report_dict = always_merger.merge(ddf_dict, notebook_counts_by_dir )
    
    code_cell_counts = ddf[ddf['cell_type']=='code'].groupby(['path']).size().to_dict()
    md_cell_counts = ddf[ddf['cell_type']=='md'].groupby(['path']).size().to_dict()
    
    for k in report_dict:
        report_dict[k]['path'] = k
        report_dict[k]['n_code_cells'] = code_cell_counts[k] if k in code_cell_counts else 'NA'
        report_dict[k]['n_md_cells'] = md_cell_counts[k] if  k in md_cell_counts else 'NA'
    
    return report_dict

Create a wrapper function for generating the report text:

In [301]:

def reporter(df, template, path_filter=''):
    feedstock = notebook_report_feedstock(df)
    report_txt=''
    for d in feedstock:
        if path_filter in d:
            report_txt = report_txt + '\n\n' + template.format(**feedstock[d])
    return report_txt

We can now use the reporter() function to generate a report based on filtered paths from a report dataframe and a template:

In [302]:

print(reporter(ddf2, report_template_full, 'tm351/Part '))



In directory `../Documents/GitHub/tm351-undercertainty/notebooks/tm351/Part 02 Notebooks` there were 6 notebooks.

- total markdown wordcount 5573.0 words across 160
- total code line count of 390 lines of code across 119 code cells
  - 228 code lines, 137 comment lines and 25 blank lines

Estimated total reading time of 288 minutes.

In [305]:

print(reporter(ddf3, report_template_full, 'tm351/Part ').replace('../Documents/GitHub/tm351-undercertainty/notebooks/tm351/',''))



In directory `Part 01 Notebooks` there were 5 notebooks.

- total markdown wordcount 3033.0 words across 65 markdown cells
- total code line count of 571 lines of code across 65 code cells
  - 327 code lines, 160 comment lines and 84 blank lines

Estimated total reading time of 143 minutes.




In directory `Part 02 Notebooks` there were 6 notebooks.

- total markdown wordcount 5573.0 words across 160 markdown cells
- total code line count of 390 lines of code across 119 code cells
  - 228 code lines, 137 comment lines and 25 blank lines

Estimated total reading time of 288 minutes.




In directory `Part 03 Notebooks` there were 4 notebooks.

- total markdown wordcount 11027.0 words across 230 markdown cells
- total code line count of 808 lines of code across 181 code cells
  - 606 code lines, 131 comment lines and 72 blank lines

Estimated total reading time of 444 minutes.




In directory `Part 04 Notebooks` there were 8 notebooks.

- total markdown wordcount 11992.0 words across 232 markdown cells
- total code line count of 917 lines of code across 259 code cells
  - 595 code lines, 260 comment lines and 64 blank lines

Estimated total reading time of 518 minutes.




In directory `Part 05 Notebooks` there were 3 notebooks.

- total markdown wordcount 8499.0 words across 105 markdown cells
- total code line count of 978 lines of code across 84 code cells
  - 510 code lines, 322 comment lines and 147 blank lines

Estimated total reading time of 231 minutes.




In directory `Part 07 Notebooks` there were 2 notebooks.

- total markdown wordcount 6024.0 words across 106 markdown cells
- total code line count of 0 lines of code across NA code cells
  - 0 code lines, 0 comment lines and 0 blank lines

Estimated total reading time of 127 minutes.




In directory `Part 08 Notebooks` there were 3 notebooks.

- total markdown wordcount 12612.0 words across 383 markdown cells
- total code line count of 770 lines of code across 155 code cells
  - 563 code lines, 59 comment lines and 163 blank lines

Estimated total reading time of 552 minutes.




In directory `Part 09 Notebooks` there were 3 notebooks.

- total markdown wordcount 9856.0 words across 254 markdown cells
- total code line count of 502 lines of code across 110 code cells
  - 359 code lines, 48 comment lines and 105 blank lines

Estimated total reading time of 384 minutes.




In directory `Part 10 Notebooks` there were 5 notebooks.

- total markdown wordcount 11511.0 words across 303 markdown cells
- total code line count of 802 lines of code across 170 code cells
  - 616 code lines, 66 comment lines and 145 blank lines

Estimated total reading time of 506 minutes.




In directory `Part 11 Notebooks` there were 6 notebooks.

- total markdown wordcount 17442.0 words across 437 markdown cells
- total code line count of 1586 lines of code across 250 code cells
  - 1357 code lines, 86 comment lines and 154 blank lines

Estimated total reading time of 733 minutes.




In directory `Part 12 Notebooks` there were 2 notebooks.

- total markdown wordcount 6570.0 words across 242 markdown cells
- total code line count of 657 lines of code across 160 code cells
  - 570 code lines, 30 comment lines and 53 blank lines

Estimated total reading time of 413 minutes.




In directory `Part 12 Notebooks/optional_part_12` there were 3 notebooks.

- total markdown wordcount 846.0 words across 21 markdown cells
- total code line count of 51 lines of code across 14 code cells
  - 37 code lines, 5 comment lines and 9 blank lines

Estimated total reading time of 39 minutes.




In directory `Part 14 Notebooks` there were 8 notebooks.

- total markdown wordcount 7077.0 words across 148 markdown cells
- total code line count of 825 lines of code across 197 code cells
  - 641 code lines, 105 comment lines and 78 blank lines

Estimated total reading time of 359 minutes.




In directory `Part 15 Notebooks` there were 10 notebooks.

- total markdown wordcount 4434.0 words across 121 markdown cells
- total code line count of 1314 lines of code across 208 code cells
  - 1077 code lines, 108 comment lines and 138 blank lines

Estimated total reading time of 336 minutes.




In directory `Part 16 Notebooks` there were 6 notebooks.

- total markdown wordcount 2214.0 words across 62 markdown cells
- total code line count of 527 lines of code across 123 code cells
  - 454 code lines, 51 comment lines and 22 blank lines

Estimated total reading time of 189 minutes.




In directory `Part 20 Notebooks` there were 2 notebooks.

- total markdown wordcount 2219.0 words across 59 markdown cells
- total code line count of 208 lines of code across 24 code cells
  - 124 code lines, 24 comment lines and 46 blank lines

Estimated total reading time of 84 minutes.




In directory `Part 21 Notebooks` there were 3 notebooks.

- total markdown wordcount 2200.0 words across 64 markdown cells
- total code line count of 426 lines of code across 45 code cells
  - 273 code lines, 44 comment lines and 109 blank lines

Estimated total reading time of 110 minutes.




In directory `Part 22 Notebooks` there were 4 notebooks.

- total markdown wordcount 5431.0 words across 174 markdown cells
- total code line count of 528 lines of code across 100 code cells
  - 355 code lines, 58 comment lines and 109 blank lines

Estimated total reading time of 279 minutes.




In directory `Part 23 Notebooks` there were 3 notebooks.

- total markdown wordcount 7645.0 words across 187 markdown cells
- total code line count of 576 lines of code across 109 code cells
  - 384 code lines, 79 comment lines and 138 blank lines

Estimated total reading time of 312 minutes.




In directory `Part 25 Notebooks` there were 3 notebooks.

- total markdown wordcount 7447.0 words across 119 markdown cells
- total code line count of 890 lines of code across 64 code cells
  - 563 code lines, 181 comment lines and 144 blank lines

Estimated total reading time of 220 minutes.




In directory `Part 26 Notebooks` there were 3 notebooks.

- total markdown wordcount 3993.0 words across 82 markdown cells
- total code line count of 828 lines of code across 45 code cells
  - 535 code lines, 130 comment lines and 153 blank lines

Estimated total reading time of 141 minutes.

Visualising Notebook Structure¶

To provide a glanceable, macroscopic way of comparing the size and structure of multiple notebooks, we can generate a simple visualisation based on screen line counts and colour codes for different cell types or cell state.

Reports that include cell index and a simple line count (for example, reprting the number of code lines or screen lines for markdown) can be rendered directly as linear visualisations showing the overall structure of a notebook.

For example:

markdown: header;
markdown: paragraph;
markdown: code block;
markdown: blank line;
code: code;
code: comment;
code: magic;
code: blank line;
other: other cells.

To profile within a cell requires access to cell internals, or generating a cell profile during cell processing.

However, it's easy enough to generate a view over the code and markdown cells.

Let's start by exploring a simple representation:

In [59]:

import matplotlib.pyplot as plt


fig, ax = plt.subplots()
ax.axis('off')

#Simple representation of lines per cell and cell colour based on cell type
n_c = [(1,'r'),(2,'pink'), (1,'cornflowerblue'), (2,'pink')]

x=0
y=0

for _n_c in n_c:
    _y = y + _n_c[0]
    plt.plot([x,x], [y,_y], _n_c[1], linewidth=5)
    y = _y   #may want to add a gap when moving from one cell to next
plt.gca().invert_yaxis()

We can get the list of cell size and colour tuples from a notebook's report data frame:

In [60]:

VIS_COLOUR_MAP  = {'md':'cornflowerblue','code':'pink'}

def cell_attrib(cell, colour='cell_type', size='n_screen_lines'):
    _colour = VIS_COLOUR_MAP[ cell[colour] ]
    return (cell[size], _colour)

report.apply(cell_attrib, axis=1).to_list()

Out[60]:

[(4, 'cornflowerblue'),
 (3, 'cornflowerblue'),
 (8, 'cornflowerblue'),
 (5, 'pink'),
 (8, 'pink'),
 (1, 'pink'),
 (3, 'pink'),
 (2, 'pink')]

Let's create a function to visualise a notebook based on its list of cell size and colour tuples; we'll also allow it to habdle multiple lists:

In [92]:

def nb_vis(cell_map, w=20, gap_boost=1, **kwargs):
    """Visualise notebook gross cell structure."""
    
    def get_gap(cell_map):
        """Automatically set the gap value based on overall length"""
        
        def get_overall_length(cell_map):
            """Get overall length of a notebook."""
            overall_len = 0
            gap = 0
            for i ,(l,t) in enumerate(cell_map):
                #i is number of cells if that's useful too?
                overall_len = overall_len + l
            return overall_len

        max_overall_len = 0
        
        if isinstance(cell_map,dict):
            for k in cell_map:
                _overall_len = get_overall_length(cell_map[k])
                max_overall_len = _overall_len if _overall_len > max_overall_len else max_overall_len
        else:
            max_overall_len = get_overall_length(cell_map)

        #Set the gap at 0.5% of the overall length
        return math.ceil(max_overall_len * 0.01)
        
        
    def plotter(cell_map, x, y, label='', header_gap = 0.2,
                linewidth = 5,
                orientation ='v', gap_colour = 'lightgrey'):
        """Plot visualisation of gross cell structure for a single notebook."""
        
        if orientation =='v':
            plt.text(x, y, label)
            y = y + header_gap
        else:
            plt.text(y, x, label)
            x = x + header_gap
            
        for _cell_map in cell_map:
            _y = y + gap if gap_colour else y
            __y = _y + _cell_map[0] + 1 #Make tiny cells slightly bigger
            
            if orientation =='v':
                X = _X = __X = x
                Y = y
                _Y =_y
                __Y = __y
            else:
                X = y
                _X = _y
                __X = __y
                Y = _Y = __Y = x
            
            #Add a coloured bar between cells
            if y > 0:
                if gap_colour:
                    plt.plot([X,_X],[Y,_Y], gap_colour, linewidth=linewidth)

            
            plt.plot([_X,__X], [_Y,__Y], _cell_map[1], linewidth=linewidth)

            y = __y

    x=0
    y=0
    
    if isinstance(cell_map,list):
        gap = get_gap(cell_map) * gap_boost
        fig, ax = plt.subplots(figsize=(w, 1))
        plotter(cell_map, x, y, **kwargs)
    elif isinstance(cell_map,dict):
        gap = get_gap(cell_map) * gap_boost
        fig, ax = plt.subplots(figsize=(w,len(cell_map)))
        for k in cell_map:
            plotter(cell_map[k], x, y, k, **kwargs)
            x = x + 1

    ax.axis('off')
    plt.gca().invert_yaxis()

We can now easily create a simple visualisation of the gross cell structure of the notebook:

In [93]:

cell_mapping = report.apply(cell_attrib, axis=1).to_list()
nb_vis(cell_mapping, orientation='h')

We can alo visualisation multiple notebooks, labelling each with the notebook name and plotted against the same length axis so that we can compare notebook sizes and structures directly.

In [94]:

nb_vis({'a':cell_mapping, 'b':cell_mapping[:3],
       'c':cell_mapping+cell_mapping, 'd':cell_mapping,}, orientation='h')

In [88]:

def cell_attribs(cells, colour='cell_type', size='n_screen_lines'):
    return cells.apply(cell_attrib, axis=1, args=(colour,size)).to_list()

zz = ddf.groupby(['filename'])[['cell_type', 'n_screen_lines']].apply(cell_attribs)
nb_vis(zz.to_dict(), orientation='h', gap_boost=1)
#[['n_total_code_lines','n_words','reading_time_mins', 'reading_time_s' ]].sum()

We can also see how they look based on reading time.

In [79]:

zz = ddf.groupby(['filename'])[['cell_type', 'reading_time_s']].apply(cell_attribs,'cell_type','reading_time_s')
nb_vis(zz.to_dict(), orientation='h', gap_boost=2)

Visualing Intra-Cell Structure¶

For example, paragraphs and code blocks in markdown cells; comment lines, empty lines, code lines, magic lines / blocks, shell command lines in code cells.

Supporting the level of detail may be be tricky. A multi-column format is probably best showing eg an approximate "screen's worth" of content in a column then the next "scroll" down displayed in the next column along.

BELOW HERE - NOTES AND TO DO¶

In [66]:

#Maintainability index
from radon.metrics import mi_visit

#If True, then count multiline strings as comment lines as well.
#This is not always safe because Python multiline strings are not always docstrings.

multi = True
mi_visit(c,multi)

Out[66]:

100.0

In [67]:

from radon.metrics import mi_parameters
mi_parameters(c, multi)

"""
the Halstead Volume
the Cyclomatic Complexity
the number of LLOC (Logical Lines of Code)
the percent of lines of comment
"""

Out[67]:

'\nthe Halstead Volume\nthe Cyclomatic Complexity\nthe number of LLOC (Logical Lines of Code)\nthe percent of lines of comment\n'

In [68]:

from radon.complexity import cc_visit

#Doesn't like %% or % magic
cc_visit(c)

Out[68]:

[]

In [69]:

from radon.metrics import h_visit
h_visit(c)

Out[69]:

Halstead(total=HalsteadReport(h1=0, h2=0, N1=0, N2=0, vocabulary=0, length=0, calculated_length=0, volume=0, difficulty=0, effort=0, time=0.0, bugs=0.0), functions=[])

Checking Notebook Metrics Evolution Over Time¶

The wily package uses radon to produce code quality reports across a git repository history and generate charts showin the evolution of metrics over the lifetime of a repository. This suggests various corollaries:

could we generate wily style measures over the recent history of a notebook code cell?
could we generate wily style temporal measures over all the reports (markdown text, as well as code) generated from a notebook across several commits of it to a git repository.

Other Cell Analysis¶

As a placeholder, should we also at least report on a count of cells that are note code or markdown cells?

Also a count of empty cells?

Is this moving towards some sort of notebook linter?

In [ ]: