Notebook

`BibTeX` Record Generator for archive.org¶

Simple script to generate a BibTex record from an archive.org identifier.

Proof of Concept¶

Given an archive.org identifier, we can look up its metadata as:

https://archive.org/metadata/IDENTIFIER/metadata

In [1]:

url_ = 'https://archive.org/metadata/{uid}/metadata'

The archive.org metadata schema is given here.

Relevant fields include:

title
creator
publisher
date (replaces the deprecated year)
volume
description
issn / isbn
subject (subject / topic tags)
rights / possible-copyright-status

We can then attempt to map fields onto appropriate fields in a BibTeX book entry record. Appropriate fields might include:

author;
editor;
title;
publisher
address
year
volume / number
note
issn / isbn

We can map many of the items directly.

The volume we may want to try to parse into a volume and part (which is to say, volumne and number). For now, just map the volumeliterally.

In [2]:

example_id = 'dli.granth.84831'

Get the archive.org metadata:

In [3]:

import requests

r = requests.get(url_.format(uid=example_id))
result = r.json()['result']
result

Out[3]:

{'identifier': 'dli.granth.84831',
 'collection': ['digitallibraryindia', 'JaiGyan'],
 'creator': 'Groome, Francis Hindes',
 'date': '1899',
 'language': 'eng',
 'mediatype': 'texts',
 'publisher': 'Hurst and Blackett, Limited (London)',
 'scanner': 'Internet Archive Python library 1.9.0',
 'subject': ['Book', ' Customs', ' Etiquette and Folklore'],
 'title': 'Gypsy Folk-Tales',
 'uploader': 'carl@media.org',
 'publicdate': '2020-07-18 01:48:07',
 'addeddate': '2020-07-18 01:48:07',
 'identifier-access': 'http://archive.org/details/dli.granth.84831',
 'identifier-ark': 'ark:/13960/t00093j2k',
 'ppi': '300',
 'ocr': 'ABBYY FineReader 11.0 (Extended OCR)',
 'page_number_confidence': '76.84',
 'notes': '<p>This item is part of a library of books, audio, video, and other materials from and about India is curated and maintained by Public Resource. The purpose of this library is to assist the students and the lifelong learners of India in their pursuit of an education so that they may better their status and their opportunities and to secure for themselves and for others justice, social, economic and political.</p> <p>This library has been posted for non-commercial purposes and facilitates fair dealing usage of academic and research materials for private use including research, for criticism and review of the work or of other works and reproduction by teachers and students in the course of instruction. Many of these materials are either unavailable or inaccessible in libraries in India, especially in some of the poorer states and this collection seeks to fill a major gap that exists in access to knowledge.</p> <p>For other collections we curate and more information, please visit the <a href="https://archive.org/details/JaiGyan?and%5B%5D=mediatype%3Acollection" rel="nofollow">Bharat Ek Khoj</a> page. Jai Gyan!</p>',
 'noarchivetorrent': 'true',
 'curation': '[curator]carl@media.org[/curator][date]20220414194920[/date][state]un-dark[/state][comment]Undarkened by Carl[/comment]'}

In [4]:

bib_data = {}

bib_map = {"date": "year",
           "description": "note",
           "creator": "author"}

for k in ['title', 'publisher', 'description',
          'volume', 'issn', 'isbn', "date", "creator"]:
    if k in result:
        k_ = bib_map[k] if k in bib_map else k 
        bib_data[k_] = result[k]
    
bib_data

Out[4]:

{'title': 'Gypsy Folk-Tales',
 'publisher': 'Hurst and Blackett, Limited (London)',
 'year': '1899',
 'author': 'Groome, Francis Hindes'}

For now, we're naively mapping the creator on to the author, although we might later want to try to improve author vs. editor resultion.

Note that the creator metadata may be presented as a list of creators, often with birth/death dates, so we need to potentially tidy that up.

In [6]:

import re

_example = ['Gregory, Lady, 1852-1932',
             'Finn, MacCumaill, 3rd cent',
             'Yeats, W. B. (William Butler), 1865-1939']

for c in _example:
    print(re.sub(' \(?[0-9]+-[0-9]+\)?', '', c))

Gregory, Lady,
Finn, MacCumaill, 3rd cent
Yeats, W. B. (William Butler),

Create an identifier for the book (we may need to leaborate the to make sure it generates a unique identifier):

In [10]:

bib_id = f"{re.sub('[^09a-zA-Z]', '', result['creator']).lower()[:7]}{result['date']}"
bib_data["bib_id"] = bib_id

bib_id

Out[10]:

'groomef1899'

Use a heuristic to generate the publisher address...

In [11]:

import parse

addr = parse.parse('{publisher} ({address})', bib_data['publisher'])
if addr:
    bib_data['publisher'] = addr['publisher']
    bib_data['address'] = addr['address']

bib_data

Out[11]:

{'title': 'Gypsy Folk-Tales',
 'publisher': 'Hurst and Blackett, Limited',
 'year': '1899',
 'author': 'Groome, Francis Hindes',
 'address': 'London',
 'bib_id': 'groomef1899'}

We now need to render the data via an appropriate BibTeX template:

In [12]:

from jinja2 import Template

tm = Template("""@book{ {{bib_id}},
  title     = "{{title}}",
  author    = "{{author}}",
  year      = "{{year}}",
  {% if volume %}volume = "{{volume}}",{% endif %}
  {% if publisher %}publisher = "{{publisher}}",{% endif %}
  {% if address %}address = "{{address}}",{% endif %}
  {% if isbn %}isbn = "{{isbn}}",{% endif %}
  {% if issn %}issn = "{{issn}}",{% endif %}
}
""")

print(tm.render(**bib_data))

@book{ groomef1899,
  title     = "Gypsy Folk-Tales",
  author    = "Groome, Francis Hindes",
  year      = "1899",
  
  publisher = "Hurst and Blackett, Limited",
  address = "London",
  
  
  
}

Cjeck that the record parses correctly, and then export it in a well-formatted way:

In [13]:

#%pip install bibtexparser
import bibtexparser

tex_ = bibtexparser.loads(tm.render(**bib_data))
print(bibtexparser.dumps(tex_))

@book{groomef1899,
 address = {London},
 author = {Groome, Francis Hindes},
 publisher = {Hurst and Blackett, Limited},
 title = {Gypsy Folk-Tales},
 year = {1899}
}

We can extract archive.org identifers from a file with the following simple pattern matcher:

In [26]:

with open("irish-legends-finn-oisin.md") as f:
    urls = re.findall('https?://archive.org/details/([^\s\n]*)[\s\n]+', f.read())

# Find the unique archive.org identifiers
ids_ = list({u.split("/")[0] for u in urls})
ids_[:3]

Out[26]:

['riujournalschoo01acadgoog', 'bub_gb_dE7pMtIozskC', 'popularstudiesin00lond']

Generate a BibTeX Record Collection¶

Let's now put the pieces together to extract a list of archive.org identifiers from a text file, look up the metadata associated with each one, and then generate a full list of BibTeX records for them.

The following function is derived from the skecthes shown above, repackaged as a function:

In [33]:

# Cache requests
import requests_cache
requests_cache.install_cache('.archive_org_metadata')

def get_metadata(uid):
    """Get metadata given an archive.org identifier."""
    r = requests.get(url_.format(uid=uid))
    result = r.json()['result']
    return result

def generate_bib_record(uid):
    """Generate a bibliographic data record
       from archive.org metadata."""

    metadata = get_metadata(uid)
    bib_data = {}

    bib_map = {"date": "year",
               "creator": "author"}

    # Handle a list of creators
    if 'creator' in metadata:
        _creators = metadata['creator'] if isinstance(metadata['creator'], list) \
                    else [metadata['creator']]
        _clean_creators = []
        for _c in _creators:
            _clean_creators.append(re.sub(' \(?[0-9]+-[0-9]+\)?', '', _c))
        metadata['creator'] = ", ".join(_clean_creators)
    if 'creator' in metadata and 'date' in metadata:
        # Create id
        record_id = re.sub('[^09a-zA-Z]', '',
                    metadata['creator']).lower()[:7]
        bib_id = f"{record_id}{metadata['date']}"
    else:
        bib_id = uid
        
    for k in ['creator', 'title', 'publisher',
              'volume', 'issn', 'isbn', "date"]:
        if k in metadata:
            k_ = bib_map[k] if k in bib_map else k 
            bib_data[k_] = metadata[k]

    bib_data["bib_id"] = bib_id

    # Try to find publisher address using simple heuristics
    if 'publisher' in bib_data:
        addr = parse.parse('{publisher} ({address})',
                           bib_data['publisher'])
        if not addr:
            addr = parse.parse('{address} : {publisher}',
                           bib_data['publisher'])
        if addr:
            bib_data['publisher'] = addr['publisher']
            bib_data['address'] = addr['address']
        
    _bibtex = bibtexparser.loads(tm.render(**bib_data))
    bibtex_ = bibtexparser.dumps(_bibtex)
    return bibtex_

We can now iterate through the identifiers and generate or list of BibTeX records.

We can also add a progress bar to help keep track of how far along we are (making the archive.org reuests might take some time...).

In [34]:

from tqdm.notebook import tqdm

records = []

# For tqdm, ensure to update jupyterlab_widgets
for uid in tqdm(ids_):
    records.append(generate_bib_record(uid))

records[:10]

  0%|          | 0/69 [00:00<?, ?it/s]

Out[34]:

['@book{schoolo1904,\n author = {School of Irish Learning (Dublin ,  Ireland),  Royal Irish Academy},\n publisher = {Royal Irish Academy},\n title = {Ériu: The Journal of the School of Irish Learning, Dublin},\n volume = {1, pt. 2},\n year = {1904}\n}\n',
 "@book{johnoma1866,\n author = {JOHN O'MAHONY},\n title = {FORAS FEASA AR EIRINN DO REIR AN ATHAR, SEATHRUN CEITING, OLLAMH RE DIADHACHTA.THE HISTORY OF IRELAND, FROM THE GAELIEST PERIOD TO THE ENGLISH INBASION.},\n year = {1866}\n}\n",
 '@book{popularstudiesin00lond,\n address = {London},\n author = {},\n publisher = {D. Nutt},\n title = {Popular studies in mythology, romance and folklore},\n year = {1899}\n}\n',
 '@book{ossiani1853,\n address = {Dublin},\n author = {Ossianic Society},\n publisher = {Printed under the direction of the Council},\n title = {Transactions of the Ossianic Society},\n volume = {4},\n year = {1853}\n}\n',
 '@book{hydedou1890,\n address = {London},\n author = {Hyde, Douglas,, Nutt, Alfred,},\n publisher = {Nutt},\n title = {Beside the fire : a collection of Irish Gaelic folk stories},\n year = {1890}\n}\n',
 '@book{youngel1910,\n address = {Dublin},\n author = {Young, Ella,, Gonne, Maud,, ill},\n publisher = {Maunsel & company},\n title = {Celtic wonder-tales},\n year = {1910}\n}\n',
 "@book{barryoc1890,\n author = {Barry O'Connor},\n publisher = {P. J. Kenedy},\n title = {Turf-fire Stories and Fairy Tales of Ireland},\n year = {1890}\n}\n",
 '@book{wildela1888,\n address = {London},\n author = {Wilde, Lady,.},\n publisher = {Ward and Downey},\n title = {Ancient legends, mystic charms, and superstitions of Ireland : With sketches of the Irish past},\n year = {1888}\n}\n',
 '@book{ossiani1853,\n address = {Dublin},\n author = {Ossianic Society},\n publisher = {Printed under the direction of the Council},\n title = {Transactions of the Ossianic Society},\n volume = {1},\n year = {1853}\n}\n',
 '@book{gregory1904,\n address = {London},\n author = {Gregory, Lady,, Finn, MacCumaill, 3rd cent, Yeats, W. B. (William Butler),},\n publisher = {J. Murray},\n title = {Gods and fighting men : the story of the Tuatha de Danaan and of the Fiana of Ireland},\n year = {1904}\n}\n']

In [ ]:

BibTeX Record Generator for archive.org¶

Proof of Concept¶

Generate a BibTeX Record Collection¶

`BibTeX` Record Generator for archive.org¶