BibTeX
Record Generator for archive.org¶Simple script to generate a BibTex record from an archive.org
identifier.
Given an archive.org
identifier, we can look up its metadata as:
https://archive.org/metadata/IDENTIFIER/metadata
url_ = 'https://archive.org/metadata/{uid}/metadata'
The archive.org
metadata schema is given here.
Relevant fields include:
title
creator
publisher
date
(replaces the deprecated year
)volume
description
issn
/ isbn
subject
(subject / topic tags)rights
/ possible-copyright-status
We can then attempt to map fields onto appropriate fields in a BibTeX
book
entry record. Appropriate fields might include:
author
;editor
;title
;publisher
address
year
volume
/ number
note
issn
/ isbn
We can map many of the items directly.
The volume we may want to try to parse into a volume and part (which is to say, volumne
and number
). For now, just map the volume
literally.
example_id = 'dli.granth.84831'
Get the archive.org
metadata:
import requests
r = requests.get(url_.format(uid=example_id))
result = r.json()['result']
result
{'identifier': 'dli.granth.84831', 'collection': ['digitallibraryindia', 'JaiGyan'], 'creator': 'Groome, Francis Hindes', 'date': '1899', 'language': 'eng', 'mediatype': 'texts', 'publisher': 'Hurst and Blackett, Limited (London)', 'scanner': 'Internet Archive Python library 1.9.0', 'subject': ['Book', ' Customs', ' Etiquette and Folklore'], 'title': 'Gypsy Folk-Tales', 'uploader': 'carl@media.org', 'publicdate': '2020-07-18 01:48:07', 'addeddate': '2020-07-18 01:48:07', 'identifier-access': 'http://archive.org/details/dli.granth.84831', 'identifier-ark': 'ark:/13960/t00093j2k', 'ppi': '300', 'ocr': 'ABBYY FineReader 11.0 (Extended OCR)', 'page_number_confidence': '76.84', 'notes': '<p>This item is part of a library of books, audio, video, and other materials from and about India is curated and maintained by Public Resource. The purpose of this library is to assist the students and the lifelong learners of India in their pursuit of an education so that they may better their status and their opportunities and to secure for themselves and for others justice, social, economic and political.</p> <p>This library has been posted for non-commercial purposes and facilitates fair dealing usage of academic and research materials for private use including research, for criticism and review of the work or of other works and reproduction by teachers and students in the course of instruction. Many of these materials are either unavailable or inaccessible in libraries in India, especially in some of the poorer states and this collection seeks to fill a major gap that exists in access to knowledge.</p> <p>For other collections we curate and more information, please visit the <a href="https://archive.org/details/JaiGyan?and%5B%5D=mediatype%3Acollection" rel="nofollow">Bharat Ek Khoj</a> page. Jai Gyan!</p>', 'noarchivetorrent': 'true', 'curation': '[curator]carl@media.org[/curator][date]20220414194920[/date][state]un-dark[/state][comment]Undarkened by Carl[/comment]'}
bib_data = {}
bib_map = {"date": "year",
"description": "note",
"creator": "author"}
for k in ['title', 'publisher', 'description',
'volume', 'issn', 'isbn', "date", "creator"]:
if k in result:
k_ = bib_map[k] if k in bib_map else k
bib_data[k_] = result[k]
bib_data
{'title': 'Gypsy Folk-Tales', 'publisher': 'Hurst and Blackett, Limited (London)', 'year': '1899', 'author': 'Groome, Francis Hindes'}
For now, we're naively mapping the creator on to the author, although we might later want to try to improve author vs. editor resultion.
Note that the creator
metadata may be presented as a list of creators, often with birth/death dates, so we need to potentially tidy that up.
import re
_example = ['Gregory, Lady, 1852-1932',
'Finn, MacCumaill, 3rd cent',
'Yeats, W. B. (William Butler), 1865-1939']
for c in _example:
print(re.sub(' \(?[0-9]+-[0-9]+\)?', '', c))
Gregory, Lady, Finn, MacCumaill, 3rd cent Yeats, W. B. (William Butler),
Create an identifier for the book (we may need to leaborate the to make sure it generates a unique identifier):
bib_id = f"{re.sub('[^09a-zA-Z]', '', result['creator']).lower()[:7]}{result['date']}"
bib_data["bib_id"] = bib_id
bib_id
'groomef1899'
Use a heuristic to generate the publisher address
...
import parse
addr = parse.parse('{publisher} ({address})', bib_data['publisher'])
if addr:
bib_data['publisher'] = addr['publisher']
bib_data['address'] = addr['address']
bib_data
{'title': 'Gypsy Folk-Tales', 'publisher': 'Hurst and Blackett, Limited', 'year': '1899', 'author': 'Groome, Francis Hindes', 'address': 'London', 'bib_id': 'groomef1899'}
We now need to render the data via an appropriate BibTeX template:
from jinja2 import Template
tm = Template("""@book{ {{bib_id}},
title = "{{title}}",
author = "{{author}}",
year = "{{year}}",
{% if volume %}volume = "{{volume}}",{% endif %}
{% if publisher %}publisher = "{{publisher}}",{% endif %}
{% if address %}address = "{{address}}",{% endif %}
{% if isbn %}isbn = "{{isbn}}",{% endif %}
{% if issn %}issn = "{{issn}}",{% endif %}
}
""")
print(tm.render(**bib_data))
@book{ groomef1899, title = "Gypsy Folk-Tales", author = "Groome, Francis Hindes", year = "1899", publisher = "Hurst and Blackett, Limited", address = "London", }
Cjeck that the record parses correctly, and then export it in a well-formatted way:
#%pip install bibtexparser
import bibtexparser
tex_ = bibtexparser.loads(tm.render(**bib_data))
print(bibtexparser.dumps(tex_))
@book{groomef1899, address = {London}, author = {Groome, Francis Hindes}, publisher = {Hurst and Blackett, Limited}, title = {Gypsy Folk-Tales}, year = {1899} }
We can extract archive.org
identifers from a file with the following simple pattern matcher:
with open("irish-legends-finn-oisin.md") as f:
urls = re.findall('https?://archive.org/details/([^\s\n]*)[\s\n]+', f.read())
# Find the unique archive.org identifiers
ids_ = list({u.split("/")[0] for u in urls})
ids_[:3]
['riujournalschoo01acadgoog', 'bub_gb_dE7pMtIozskC', 'popularstudiesin00lond']
Let's now put the pieces together to extract a list of archive.org
identifiers from a text file, look up the metadata associated with each one, and then generate a full list of BibTeX records for them.
The following function is derived from the skecthes shown above, repackaged as a function:
# Cache requests
import requests_cache
requests_cache.install_cache('.archive_org_metadata')
def get_metadata(uid):
"""Get metadata given an archive.org identifier."""
r = requests.get(url_.format(uid=uid))
result = r.json()['result']
return result
def generate_bib_record(uid):
"""Generate a bibliographic data record
from archive.org metadata."""
metadata = get_metadata(uid)
bib_data = {}
bib_map = {"date": "year",
"creator": "author"}
# Handle a list of creators
if 'creator' in metadata:
_creators = metadata['creator'] if isinstance(metadata['creator'], list) \
else [metadata['creator']]
_clean_creators = []
for _c in _creators:
_clean_creators.append(re.sub(' \(?[0-9]+-[0-9]+\)?', '', _c))
metadata['creator'] = ", ".join(_clean_creators)
if 'creator' in metadata and 'date' in metadata:
# Create id
record_id = re.sub('[^09a-zA-Z]', '',
metadata['creator']).lower()[:7]
bib_id = f"{record_id}{metadata['date']}"
else:
bib_id = uid
for k in ['creator', 'title', 'publisher',
'volume', 'issn', 'isbn', "date"]:
if k in metadata:
k_ = bib_map[k] if k in bib_map else k
bib_data[k_] = metadata[k]
bib_data["bib_id"] = bib_id
# Try to find publisher address using simple heuristics
if 'publisher' in bib_data:
addr = parse.parse('{publisher} ({address})',
bib_data['publisher'])
if not addr:
addr = parse.parse('{address} : {publisher}',
bib_data['publisher'])
if addr:
bib_data['publisher'] = addr['publisher']
bib_data['address'] = addr['address']
_bibtex = bibtexparser.loads(tm.render(**bib_data))
bibtex_ = bibtexparser.dumps(_bibtex)
return bibtex_
We can now iterate through the identifiers and generate or list of BibTeX records.
We can also add a progress bar to help keep track of how far along we are (making the archive.org
reuests might take some time...).
from tqdm.notebook import tqdm
records = []
# For tqdm, ensure to update jupyterlab_widgets
for uid in tqdm(ids_):
records.append(generate_bib_record(uid))
records[:10]
0%| | 0/69 [00:00<?, ?it/s]
['@book{schoolo1904,\n author = {School of Irish Learning (Dublin , Ireland), Royal Irish Academy},\n publisher = {Royal Irish Academy},\n title = {Ériu: The Journal of the School of Irish Learning, Dublin},\n volume = {1, pt. 2},\n year = {1904}\n}\n', "@book{johnoma1866,\n author = {JOHN O'MAHONY},\n title = {FORAS FEASA AR EIRINN DO REIR AN ATHAR, SEATHRUN CEITING, OLLAMH RE DIADHACHTA.THE HISTORY OF IRELAND, FROM THE GAELIEST PERIOD TO THE ENGLISH INBASION.},\n year = {1866}\n}\n", '@book{popularstudiesin00lond,\n address = {London},\n author = {},\n publisher = {D. Nutt},\n title = {Popular studies in mythology, romance and folklore},\n year = {1899}\n}\n', '@book{ossiani1853,\n address = {Dublin},\n author = {Ossianic Society},\n publisher = {Printed under the direction of the Council},\n title = {Transactions of the Ossianic Society},\n volume = {4},\n year = {1853}\n}\n', '@book{hydedou1890,\n address = {London},\n author = {Hyde, Douglas,, Nutt, Alfred,},\n publisher = {Nutt},\n title = {Beside the fire : a collection of Irish Gaelic folk stories},\n year = {1890}\n}\n', '@book{youngel1910,\n address = {Dublin},\n author = {Young, Ella,, Gonne, Maud,, ill},\n publisher = {Maunsel & company},\n title = {Celtic wonder-tales},\n year = {1910}\n}\n', "@book{barryoc1890,\n author = {Barry O'Connor},\n publisher = {P. J. Kenedy},\n title = {Turf-fire Stories and Fairy Tales of Ireland},\n year = {1890}\n}\n", '@book{wildela1888,\n address = {London},\n author = {Wilde, Lady,.},\n publisher = {Ward and Downey},\n title = {Ancient legends, mystic charms, and superstitions of Ireland : With sketches of the Irish past},\n year = {1888}\n}\n', '@book{ossiani1853,\n address = {Dublin},\n author = {Ossianic Society},\n publisher = {Printed under the direction of the Council},\n title = {Transactions of the Ossianic Society},\n volume = {1},\n year = {1853}\n}\n', '@book{gregory1904,\n address = {London},\n author = {Gregory, Lady,, Finn, MacCumaill, 3rd cent, Yeats, W. B. (William Butler),},\n publisher = {J. Murray},\n title = {Gods and fighting men : the story of the Tuatha de Danaan and of the Fiana of Ireland},\n year = {1904}\n}\n']