In the late 19th and early 20th century, Andrew Lang published various collections of fairy tales, starting with The Blue Fairy Book and then progressing though various other colours to The Olive Fairy Book.
This notebook represents a playful aside in trying to build various searchable contexts over the stories.
To begin with, let's start by ingesting the stories into a database and building a full text search over them...
We can download the raw text for each of Lang's Fairy Books from the Sacred Texts website:
import requests
url = "https://www.sacred-texts.com/neu/lfb/index.htm"
html = requests.get(url)
html.text[:1000]
'<HTML>\r\n<HEAD>\r\n<link rel="stylesheet" href="../../css/ista.css"><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">\r\n<link rel="alternate" type="application/rss+xml" title="RSS" href="http://sacred-texts.com/rss/new.xml">\r\n\r\n<META name="description"\r\ncontent="Sacred Texts: Lang Fairy Books">\r\n<META name="keywords"\r\ncontent="Colored Fairy Books Fairy Tales Tale Folklore Folk lore Children Literature">\r\n<TITLE>Sacred-Texts: Lang Fairy Books</TITLE></HEAD>\r\n<BODY>\r\n<table width="800" border="0" align="center" cellpadding="0" cellspacing="0"><tr> \r\n<td height="131" width="200" align="left" valign="top"> \r\n<div align="left"><a href="../../cdshop/index.htm"><img src="../../img/cdad.gif" width="206" height="136" border="0"></a></div>\r\n</td>\r\n<td height="131" width="600" colspan="3"><div align="left"><img src="../../img/menu.jpg" width="600" height="134" usemap="#Map" border="0"><map name="Map"><area shape="rect" coords="33,5,552,78" href="../../index.htm" alt="sacred-texts.co'
By inspection of the HTML, we see the books are in span
tag with a ista-content
class. Digging further, we then notice the links are in c_t
classed span
elements. We can extract them using beautiful soup:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html.content, "html.parser")
# Find the spans containing the links
items_ = soup.find("span", class_="ista-content").find_all("span", class_="c_t")
# And then reduce those to just the links
items_ = [item.find("a")for item in items_]
We can then extract the relative links and generate full links for each book page:
base_url = url.replace("index.htm", "")
links = [(link.text, f"{base_url}{link.get('href')}") for link in items_]
links[:3]
[('The Blue Fairy Book', 'https://www.sacred-texts.com/neu/lfb/bl/index.htm'), ('The Brown Fairy Book', 'https://www.sacred-texts.com/neu/lfb/br/index.htm'), ('The Crimson Fairy Book', 'https://www.sacred-texts.com/neu/lfb/cr/index.htm')]
We could now load each of those pages and then scrape the download link. But, we notice that the download links have a regular pattern: https://www.sacred-texts.com/neu/lfb/bl/blfb.txt.gz
which we can derive from the book pages:
download_links = []
for (_title, _url) in links:
# We need to get the "short" colour name of the book
# which can be found in the URL path...
book_path = _url.split("/")[-2]
zip_fn = f"{book_path}fb.txt.gz"
zip_url = _url.replace("index.htm", zip_fn)
download_links.append((_title, zip_url))
download_links[:3]
[('The Blue Fairy Book', 'https://www.sacred-texts.com/neu/lfb/bl/blfb.txt.gz'), ('The Brown Fairy Book', 'https://www.sacred-texts.com/neu/lfb/br/brfb.txt.gz'), ('The Crimson Fairy Book', 'https://www.sacred-texts.com/neu/lfb/cr/crfb.txt.gz')]
Now we can download and unzip the files...
import urllib
for (_, url) in download_links:
# Create a file name to save file to as the file downloaded from the URL
zip_file = url.split("/")[-1]
urllib.request.urlretrieve(url, zip_file)
!ls
blfb.txt.gz sacred-texts-brfb-brown.txt brfb.txt.gz sacred-texts-crfb-crimson.txt build-db.ipynb sacred-texts-gnfb-green.txt crfb.txt.gz sacred-texts-gyfb-grey.txt demo.db sacred-texts-lifb-lilac.txt gnfb.txt.gz sacred-texts-olfb-olive.txt gyfb.txt.gz sacred-texts-orfb-orange.txt lifb.txt.gz sacred-texts-pifb-pink.txt olfb.txt.gz sacred-texts-refb-red.txt orfb.txt.gz sacred-texts-vifb-violet.txt pifb.txt.gz sacred-texts-yefb-yellow.txt refb.txt.gz vifb.txt.gz sacred-texts-blfb-blue.txt yefb.txt.gz
The following function will read in the contents of a local gzip file:
import gzip
def gzip_txt(fn):
"""Open gzip file and extract text."""
with gzip.open(fn,'rb') as f:
txt = f.read().decode('UTF-8').replace("\r", "")
return txt
gzip_txt('gnfb.txt.gz')[:1000]
'\nThe Green Fairy Book, by Andrew Lang, [1892], at sacred-texts.com\n\nTHE GREEN FAIRY BOOK\n\nBy Various\n\nEdited by Andrew Lang\n\nLondon, New York: Longmans, Green, and Co.\n\n[1892]\n\nTo\n\nStella Margaret Alleyne\n\nthe\n\nGreen Fairy Book\n\nis dedicated\n\nThe Green Fairy Book, by Andrew Lang, [1892], at sacred-texts.com\n\nContents\n\n[*To the Friendly Reader]\n\n[*The Blue Bird]\n\n[*The Half-Chick]\n\n[*The Story of Caliph Stork]\n\n[*The Enchanted Watch]\n\n[*Rosanella]\n\n[*Sylvain and Jocosa]\n\n[*Fairy Gifts]\n\n[*Prince Narcissus and the Princess Potentilla]\n\n[*Prince Featherhead and the Princess Celandine]\n\n[*The Three Little Pigs]\n\n[*Heart of Ice]\n\n[*The Enchanted Ring]\n\n[*The Snuff-box]\n\n[*The Golden Blackbird]\n\n[*The Little Soldier]\n\n[*The Magic Swan]\n\n[*The Dirty Shepherdess]\n\n[*The Enchanted Snake]\n\n[*The Biter Bit]\n\n[*King Kojata]\n\n[*Prince Fickle and Fair Helena]\n\n[*Puddocky]\n\n[*The Story of Hok Lee and the Dwarfs]\n\n[*The Story of the Three Bears]\n\n[*Prince Vivien and the Princess Placida]\n\n[*Little One'
!ls
blfb.txt.gz sacred-texts-brfb-brown.txt brfb.txt.gz sacred-texts-crfb-crimson.txt build-db.ipynb sacred-texts-gnfb-green.txt crfb.txt.gz sacred-texts-gyfb-grey.txt demo.db sacred-texts-lifb-lilac.txt gnfb.txt.gz sacred-texts-olfb-olive.txt gyfb.txt.gz sacred-texts-orfb-orange.txt lifb.txt.gz sacred-texts-pifb-pink.txt olfb.txt.gz sacred-texts-refb-red.txt orfb.txt.gz sacred-texts-vifb-violet.txt pifb.txt.gz sacred-texts-yefb-yellow.txt refb.txt.gz vifb.txt.gz sacred-texts-blfb-blue.txt yefb.txt.gz
Select one of the books and read in the book text:
txt = gzip_txt('blfb.txt.gz')
# Preview the first 1500 characters
txt[:1500]
'\nThe Blue Fairy Book, by Andrew Lang, [1889], at sacred-texts.com\n\nTHE BLUE FAIRY BOOK\n\nby Andrew Lang\n\nLondon, New York: Longmans, Green\n\n[1889]\n\nThe Blue Fairy Book, by Andrew Lang, [1889], at sacred-texts.com\n\nCONTENTS\n\n[*THE BRONZE RING]\n\n[*PRINCE HYACINTH AND THE DEAR LITTLE PRINCESS]\n\n[*EAST OF THE SUN AND WEST OF THE MOON]\n\n[*THE YELLOW DWARF]\n\n[*LITTLE RED RIDING-HOOD]\n\n[*THE SLEEPING BEAUTY IN THE WOOD]\n\n[*CINDERELLA; OR, THE LITTLE GLASS SLIPPER]\n\n[*ALADDIN AND THE WONDERFUL LAMP]\n\n[*THE TALE OF A YOUTH WHO SET OUT TO LEARN WHAT FEAR WAS]\n\n[*RUMPELSTILTZKIN]\n\n[*BEAUTY AND THE BEAST]\n\n[*THE MASTER-MAID]\n\n[*WHY THE SEA IS SALT]\n\n[*THE MASTER CAT; OR, PUSS IN BOOTS]\n\n[*FELICIA AND THE POT OF PINKS]\n\n[*THE WHITE CAT]\n\n[*THE WATER-LILY. THE GOLD-SPINNERS]\n\n[*THE TERRIBLE HEAD]\n\n[*THE STORY OF PRETTY GOLDILOCKS]\n\n[*THE HISTORY OF WHITTINGTON]\n\n[*THE WONDERFUL SHEEP]\n\n[*LITTLE THUMB]\n\n[*THE FORTY THIEVES]\n\n[*HANSEL AND GRETTEL]\n\n[*SNOW-WHITE AND ROSE-RED]\n\n[*THE GOOSE-GIRL]\n\n[*TOADS AND DIAMONDS]\n\n[*PRINCE DARLING]\n\n[*BLUE BEARD]\n\n[*TRUSTY JOHN]\n\n[*THE BRAVE LITTLE TAILOR]\n\n[*A VOYAGE TO LILLIPUT]\n\n[*THE PRINCESS ON THE GLASS HILL]\n\n[*THE STORY OF PRINCE AHMED AND THE FAIRY PARIBANOU]\n\n[*THE HISTORY OF JACK THE GIANT-KILLER]\n\n[*THE BLACK BULL OF NORROWAY]\n\n[*THE RED ETIN]\n\nThe Blue Fairy Book, by Andrew Lang, [1889], at sacred-texts.com\n\n[f01]\n\nTHE BRONZE RING\n\nOnce upon a time in a certain country there lived a king whose palace was surrounded by a spacious garden. But, t'
Having got the contents, let's now extract all the stories.
Within each book, the stories are delimited by a pattern [fNN]
(for digits N
). We can use this pattern to split out the stories.
import re
# Split on the patter: [fNN]
stories = re.split("\[f\d{2}\]", txt)
# Strip whitespace at start and end
stories = [s.strip("\n") for s in stories]
The contents appear in the first "story chunk" (index 0
) in the text:
stories[0]
'The Blue Fairy Book, by Andrew Lang, [1889], at sacred-texts.com\n\nTHE BLUE FAIRY BOOK\n\nby Andrew Lang\n\nLondon, New York: Longmans, Green\n\n[1889]\n\nThe Blue Fairy Book, by Andrew Lang, [1889], at sacred-texts.com\n\nCONTENTS\n\n[*THE BRONZE RING]\n\n[*PRINCE HYACINTH AND THE DEAR LITTLE PRINCESS]\n\n[*EAST OF THE SUN AND WEST OF THE MOON]\n\n[*THE YELLOW DWARF]\n\n[*LITTLE RED RIDING-HOOD]\n\n[*THE SLEEPING BEAUTY IN THE WOOD]\n\n[*CINDERELLA; OR, THE LITTLE GLASS SLIPPER]\n\n[*ALADDIN AND THE WONDERFUL LAMP]\n\n[*THE TALE OF A YOUTH WHO SET OUT TO LEARN WHAT FEAR WAS]\n\n[*RUMPELSTILTZKIN]\n\n[*BEAUTY AND THE BEAST]\n\n[*THE MASTER-MAID]\n\n[*WHY THE SEA IS SALT]\n\n[*THE MASTER CAT; OR, PUSS IN BOOTS]\n\n[*FELICIA AND THE POT OF PINKS]\n\n[*THE WHITE CAT]\n\n[*THE WATER-LILY. THE GOLD-SPINNERS]\n\n[*THE TERRIBLE HEAD]\n\n[*THE STORY OF PRETTY GOLDILOCKS]\n\n[*THE HISTORY OF WHITTINGTON]\n\n[*THE WONDERFUL SHEEP]\n\n[*LITTLE THUMB]\n\n[*THE FORTY THIEVES]\n\n[*HANSEL AND GRETTEL]\n\n[*SNOW-WHITE AND ROSE-RED]\n\n[*THE GOOSE-GIRL]\n\n[*TOADS AND DIAMONDS]\n\n[*PRINCE DARLING]\n\n[*BLUE BEARD]\n\n[*TRUSTY JOHN]\n\n[*THE BRAVE LITTLE TAILOR]\n\n[*A VOYAGE TO LILLIPUT]\n\n[*THE PRINCESS ON THE GLASS HILL]\n\n[*THE STORY OF PRINCE AHMED AND THE FAIRY PARIBANOU]\n\n[*THE HISTORY OF JACK THE GIANT-KILLER]\n\n[*THE BLACK BULL OF NORROWAY]\n\n[*THE RED ETIN]\n\nThe Blue Fairy Book, by Andrew Lang, [1889], at sacred-texts.com'
Let's pull out the book name:
# The name appears before the first comma
book = stories[0].split(",")[0]
book
'The Blue Fairy Book'
Alternatively, we could extract against a pattern:
#The Blue Fairy Book, by Andrew Lang, [1889], at sacred-texts.com
metadata = parse.parse("{title}, by Andrew Lang, [{date}]{}, at sacred-texts.com", stories[0])
metadata["title"], metadata["date"]
('The Red Fairy Book', '1890')
There are plenty of cribs to help us pull out the contents, although it may not be obviously clear with the early content items whether they are stories or not...
# There is a Contents header, but it may be cased...
# So split in a case insensitive way
boilerplate = re.split('(Contents|CONTENTS)', stories[0])
boilerplate
['The Blue Fairy Book, by Andrew Lang, [1889], at sacred-texts.com\n\nTHE BLUE FAIRY BOOK\n\nby Andrew Lang\n\nLondon, New York: Longmans, Green\n\n[1889]\n\nThe Blue Fairy Book, by Andrew Lang, [1889], at sacred-texts.com\n\n', 'CONTENTS', '\n\n[*THE BRONZE RING]\n\n[*PRINCE HYACINTH AND THE DEAR LITTLE PRINCESS]\n\n[*EAST OF THE SUN AND WEST OF THE MOON]\n\n[*THE YELLOW DWARF]\n\n[*LITTLE RED RIDING-HOOD]\n\n[*THE SLEEPING BEAUTY IN THE WOOD]\n\n[*CINDERELLA; OR, THE LITTLE GLASS SLIPPER]\n\n[*ALADDIN AND THE WONDERFUL LAMP]\n\n[*THE TALE OF A YOUTH WHO SET OUT TO LEARN WHAT FEAR WAS]\n\n[*RUMPELSTILTZKIN]\n\n[*BEAUTY AND THE BEAST]\n\n[*THE MASTER-MAID]\n\n[*WHY THE SEA IS SALT]\n\n[*THE MASTER CAT; OR, PUSS IN BOOTS]\n\n[*FELICIA AND THE POT OF PINKS]\n\n[*THE WHITE CAT]\n\n[*THE WATER-LILY. THE GOLD-SPINNERS]\n\n[*THE TERRIBLE HEAD]\n\n[*THE STORY OF PRETTY GOLDILOCKS]\n\n[*THE HISTORY OF WHITTINGTON]\n\n[*THE WONDERFUL SHEEP]\n\n[*LITTLE THUMB]\n\n[*THE FORTY THIEVES]\n\n[*HANSEL AND GRETTEL]\n\n[*SNOW-WHITE AND ROSE-RED]\n\n[*THE GOOSE-GIRL]\n\n[*TOADS AND DIAMONDS]\n\n[*PRINCE DARLING]\n\n[*BLUE BEARD]\n\n[*TRUSTY JOHN]\n\n[*THE BRAVE LITTLE TAILOR]\n\n[*A VOYAGE TO LILLIPUT]\n\n[*THE PRINCESS ON THE GLASS HILL]\n\n[*THE STORY OF PRINCE AHMED AND THE FAIRY PARIBANOU]\n\n[*THE HISTORY OF JACK THE GIANT-KILLER]\n\n[*THE BLACK BULL OF NORROWAY]\n\n[*THE RED ETIN]\n\nThe Blue Fairy Book, by Andrew Lang, [1889], at sacred-texts.com']
# The name of the book repeats at the end of the content block
# So snip it out...
contents_ = boilerplate[-1].split(book)[0].strip("\n")
contents_
'[*THE BRONZE RING]\n\n[*PRINCE HYACINTH AND THE DEAR LITTLE PRINCESS]\n\n[*EAST OF THE SUN AND WEST OF THE MOON]\n\n[*THE YELLOW DWARF]\n\n[*LITTLE RED RIDING-HOOD]\n\n[*THE SLEEPING BEAUTY IN THE WOOD]\n\n[*CINDERELLA; OR, THE LITTLE GLASS SLIPPER]\n\n[*ALADDIN AND THE WONDERFUL LAMP]\n\n[*THE TALE OF A YOUTH WHO SET OUT TO LEARN WHAT FEAR WAS]\n\n[*RUMPELSTILTZKIN]\n\n[*BEAUTY AND THE BEAST]\n\n[*THE MASTER-MAID]\n\n[*WHY THE SEA IS SALT]\n\n[*THE MASTER CAT; OR, PUSS IN BOOTS]\n\n[*FELICIA AND THE POT OF PINKS]\n\n[*THE WHITE CAT]\n\n[*THE WATER-LILY. THE GOLD-SPINNERS]\n\n[*THE TERRIBLE HEAD]\n\n[*THE STORY OF PRETTY GOLDILOCKS]\n\n[*THE HISTORY OF WHITTINGTON]\n\n[*THE WONDERFUL SHEEP]\n\n[*LITTLE THUMB]\n\n[*THE FORTY THIEVES]\n\n[*HANSEL AND GRETTEL]\n\n[*SNOW-WHITE AND ROSE-RED]\n\n[*THE GOOSE-GIRL]\n\n[*TOADS AND DIAMONDS]\n\n[*PRINCE DARLING]\n\n[*BLUE BEARD]\n\n[*TRUSTY JOHN]\n\n[*THE BRAVE LITTLE TAILOR]\n\n[*A VOYAGE TO LILLIPUT]\n\n[*THE PRINCESS ON THE GLASS HILL]\n\n[*THE STORY OF PRINCE AHMED AND THE FAIRY PARIBANOU]\n\n[*THE HISTORY OF JACK THE GIANT-KILLER]\n\n[*THE BLACK BULL OF NORROWAY]\n\n[*THE RED ETIN]'
We can parse out titles from the contents list based on a pattern:
import parse
contents = parse.findall("[*{}]", contents_)
# The title text available as item.fixed[0]
# Also convert the title to title case
titles = [item.fixed[0].title() for item in contents]
titles
['The Bronze Ring', 'Prince Hyacinth And The Dear Little Princess', 'East Of The Sun And West Of The Moon', 'The Yellow Dwarf', 'Little Red Riding-Hood', 'The Sleeping Beauty In The Wood', 'Cinderella; Or, The Little Glass Slipper', 'Aladdin And The Wonderful Lamp', 'The Tale Of A Youth Who Set Out To Learn What Fear Was', 'Rumpelstiltzkin', 'Beauty And The Beast', 'The Master-Maid', 'Why The Sea Is Salt', 'The Master Cat; Or, Puss In Boots', 'Felicia And The Pot Of Pinks', 'The White Cat', 'The Water-Lily. The Gold-Spinners', 'The Terrible Head', 'The Story Of Pretty Goldilocks', 'The History Of Whittington', 'The Wonderful Sheep', 'Little Thumb', 'The Forty Thieves', 'Hansel And Grettel', 'Snow-White And Rose-Red', 'The Goose-Girl', 'Toads And Diamonds', 'Prince Darling', 'Blue Beard', 'Trusty John', 'The Brave Little Tailor', 'A Voyage To Lilliput', 'The Princess On The Glass Hill', 'The Story Of Prince Ahmed And The Fairy Paribanou', 'The History Of Jack The Giant-Killer', 'The Black Bull Of Norroway', 'The Red Etin']
There seems to be work in progress adding page numbers to books using a pattern of the form [p. ix]
, [p. 1]
, [p. 11]
and so on.
For now, let's create a regular expression substitution to remove those...
example = """[f01]
[p. ix]
THE YELLOW FAIRY BOOK
THE CAT AND THE MOUSE IN PARTNERSHIP
A cat had made acquaintance with a mouse, and had spoken so much of the great love and friendship she felt for her, that at last the Mouse consented to live in the same house with her, and to go shares in the housekeeping. 'But we must provide for the winter or else we shall suffer hunger,' said the Cat. 'You, little Mouse, cannot venture everywhere in case you run at last into a trap.' This good counsel was followed, and a little pot of fat was bought. But they did not know where to put it. At length, after long consultation, the Cat said, 'I know of no place where it could be better put than in the church. No one will trouble to take it away from there. We will hide it in a corner, and we won't touch it till we are in want.' So the little pot was placed in safety; but it was not long before the Cat had a great longing for it, and said to the Mouse, 'I wanted to tell you, little Mouse, that my cousin has a little son, white with brown spots, and she wants me to be godmother to it. Let me go out to-day, and do you take care of the house alone.'
[p. 1]
'Yes, go certainly,' replied the Mouse, 'and when you eat anything good, think of me; I should very much like a drop of the red christening wine.'
But it was all untrue. The Cat had no cousin, and had not been asked to be godmother. She went straight to the church, slunk to the little pot of fat, began to lick it, and licked the top off. Then she took a walk on the roofs of the town, looked at the view, stretched
[P. 22]
herself out in the sun, and licked her lips whenever she thought of the little pot of fat. As soon as it was evening she went home again.
"""
# Example of regex to remove page numbers
re.sub(r'\n*\[[pP]\. [^\]\s]*\]\n\n', '', example)
"[f01]THE YELLOW FAIRY BOOK\n\nTHE CAT AND THE MOUSE IN PARTNERSHIP\n\nA cat had made acquaintance with a mouse, and had spoken so much of the great love and friendship she felt for her, that at last the Mouse consented to live in the same house with her, and to go shares in the housekeeping. 'But we must provide for the winter or else we shall suffer hunger,' said the Cat. 'You, little Mouse, cannot venture everywhere in case you run at last into a trap.' This good counsel was followed, and a little pot of fat was bought. But they did not know where to put it. At length, after long consultation, the Cat said, 'I know of no place where it could be better put than in the church. No one will trouble to take it away from there. We will hide it in a corner, and we won't touch it till we are in want.' So the little pot was placed in safety; but it was not long before the Cat had a great longing for it, and said to the Mouse, 'I wanted to tell you, little Mouse, that my cousin has a little son, white with brown spots, and she wants me to be godmother to it. Let me go out to-day, and do you take care of the house alone.''Yes, go certainly,' replied the Mouse, 'and when you eat anything good, think of me; I should very much like a drop of the red christening wine.'\n\nBut it was all untrue. The Cat had no cousin, and had not been asked to be godmother. She went straight to the church, slunk to the little pot of fat, began to lick it, and licked the top off. Then she took a walk on the roofs of the town, looked at the view, stretchedherself out in the sun, and licked her lips whenever she thought of the little pot of fat. As soon as it was evening she went home again.\n"
Let's create a function to parse the book for us by pulling together all the previous fragments:
def parse_book(txt):
"""Parse book from text."""
# Get story chunks
stories = re.split("\[f\d{2}\]", txt)
stories = [s.strip("\n") for s in stories]
# Get book name
book = stories[0].split(",")[0]
# Process contents
boilerplate = re.split('(Contents|CONTENTS)', stories[0])
# The name of the book repeats at the end of the content block
# So snip it out...
contents_ = boilerplate[-1].split(book)[0].strip("\n")
# Get titles from contents
titles = [item.fixed[0].title() for item in contents]
return book, stories, titles
Let's create a simple database structure and configure it for full text search:
from sqlite_utils import Database
# While developing the script, recreate database each time...
db_name = "demo.db"
db = Database(db_name, recreate=True)
# This schema has been evolved iteratively as I have identified structure
# that can be usefully mined...
db["books"].create({
"book": str,
"title": str,
"text": str,
"last_line": str, # sometimes contains provenance
"provenance": str, # attempt at provenance
}, pk=("book", "title"))
db["books"].enable_fts(["title", "text"], create_triggers=True)
<Table books (book, title, text, last_line, provenance)>
Let's now create a function that can populate our database based on the contents of one of the books:
def extract_book_stories(db_tbl, book, stories, titles=None, quiet=False):
book_items = []
# The titles are from the contents list
# We will actually grab titles from the story
# but the titles grabbed from the contents can be passed in
# if we want to write a check against them.
# Note: there may be punctation differnces in the title in the contents
# and the actual title in the text
for story in stories[1:]:
# Remove the page numbers for now...
story = re.sub(r'\n*\[[pP]\. [^\]\s]*\]\n\n', '', story).strip("\n")
# Other cleaning
story = re.sub(r'\[\*\d+\s*\]', '', story)
# Get the title from the start of the story text
story_ = story.split("\n\n")
title_ = story_[0].strip()
# Force the title case variant of the title
title = title_.title().replace("'S", "'s")
# Optionally display the titles and the book
if not quiet:
print(f"{title} :: {book}")
# Reassemble the story
text = "\n\n".join(story_[1:])
# Clean out the name of the book if it is in the text
#The Green Fairy Book, by Andrew Lang, [1892], at sacred-texts.com
name_ignorecase = re.compile(f"{book}, by Andrew Lang, \[\d*\], at sacred-texts.com", re.IGNORECASE)
text = name_ignorecase.sub('', text).strip()
last_line = text.split("\n")[-1]
provenance_1 = parse.parse('[{}] {provenance}', last_line)
provenance_2 = parse.parse('[{provenance}]', last_line)
provenance_3 = parse.parse('({provenance})', last_line)
provenance_4 = {"provenance":last_line} if len(last_line.split())<7 else {} # Heuristic
provenance_ = provenance_1 or provenance_2 or provenance_3 or provenance_4
provenance = provenance_["provenance"] if provenance_ else ""
book_items.append({"book": book,
"title": title,
"text": text,
"last_line": last_line,
"provenance": provenance})
db_tbl.upsert_all(book_items, pk=("book", "title" ))
We can add the data for a particular book by passing in the titles and stories:
book, stories, titles = parse_book(txt)
extract_book_stories(db["books"], book, stories)
Preface :: The Red Fairy Book The Twelve Dancing Princesses :: The Red Fairy Book The Princess Mayblossom :: The Red Fairy Book Soria Moria Castle :: The Red Fairy Book The Death Of Koshchei The Deathless :: The Red Fairy Book The Black Thief And Knight Of The Glen. :: The Red Fairy Book The Master Thief :: The Red Fairy Book Brother And Sister :: The Red Fairy Book Princess Rosette :: The Red Fairy Book The Enchanted Pig :: The Red Fairy Book The Norka :: The Red Fairy Book The Wonderful Birch :: The Red Fairy Book Jack And The Beanstalk :: The Red Fairy Book The Little Good Mouse :: The Red Fairy Book Graciosa And Percinet :: The Red Fairy Book The Three Princesses Of Whiteland :: The Red Fairy Book The Voice Of Death :: The Red Fairy Book The Six Sillies :: The Red Fairy Book Kari Woodengown :: The Red Fairy Book Drakestail :: The Red Fairy Book The Ratcatcher :: The Red Fairy Book The True History Of Little Golden Hood :: The Red Fairy Book The Golden Branch :: The Red Fairy Book The Three Dwarfs :: The Red Fairy Book Dapplegrim :: The Red Fairy Book The Enchanted Canary :: The Red Fairy Book The Twelve Brothers :: The Red Fairy Book Rapunzel :: The Red Fairy Book The Nettle Spinner :: The Red Fairy Book Farmer Weatherbeard :: The Red Fairy Book Mother Holle :: The Red Fairy Book Minnikin :: The Red Fairy Book Bushy Bride :: The Red Fairy Book Snowdrop :: The Red Fairy Book The Golden Goose :: The Red Fairy Book The Seven Foals :: The Red Fairy Book The Marvellous Musician :: The Red Fairy Book The Story Of Sigurd :: The Red Fairy Book
We can run a full text search over the stories. For example, if we are looking for a story with a king and three sons:
q = 'king "three sons"'
for story in db["books"].search(db.quote_fts(q), columns=["title", "book"]):
print(story)
{'title': 'The Black Thief And Knight Of The Glen.', 'book': 'The Red Fairy Book'} {'title': 'The Golden Goose', 'book': 'The Red Fairy Book'} {'title': 'The Norka', 'book': 'The Red Fairy Book'} {'title': 'The Seven Foals', 'book': 'The Red Fairy Book'} {'title': 'The Story Of Sigurd', 'book': 'The Red Fairy Book'}
We can also churn through all the books and add them to the database:
import os
for fn in [fn for fn in os.listdir() if fn.endswith(".gz")]:
# Read in book from gzip file
txt = gzip_txt(fn)
# Parse book
book, stories, titles = parse_book(txt)
# Extract stories and add them to the database
extract_book_stories(db["books"], book, stories, quiet=True)
How many stories do we now have with a king and three sons?
print(f"Search on: {q}\n")
for story in db["books"].search(db.quote_fts(q), columns=["title", "book"]):
print(story)
Search on: king "three sons" {'title': 'The Three Brothers', 'book': 'The Pink Fairy Book'} {'title': 'The Princess Who Was Hidden Underground', 'book': 'The Violet Fairy Book'} {'title': 'The Black Thief And Knight Of The Glen.', 'book': 'The Red Fairy Book'} {'title': 'Blockhead-Hans', 'book': 'The Yellow Fairy Book'} {'title': 'The Golden Lion', 'book': 'The Pink Fairy Book'} {'title': 'The Golden Goose', 'book': 'The Red Fairy Book'} {'title': 'The Master Cat; Or, Puss In Boots', 'book': 'The Blue Fairy Book'} {'title': 'The Enchanted Watch', 'book': 'The Green Fairy Book'} {'title': 'The Story Of The Fair Circassians', 'book': 'The Grey Fairy Book'} {'title': 'The Norka', 'book': 'The Red Fairy Book'} {'title': 'Tritill, Litill, And The Birds', 'book': 'The Crimson Fairy Book'} {'title': 'The Seven Foals', 'book': 'The Red Fairy Book'} {'title': 'The Witch And Her Servants', 'book': 'The Yellow Fairy Book'} {'title': 'The Flying Ship', 'book': 'The Yellow Fairy Book'} {'title': "How The Hermit Helped To Win The King's Daughter", 'book': 'The Pink Fairy Book'} {'title': 'The Prince And The Dragon', 'book': 'The Crimson Fairy Book'} {'title': 'The Story Of Sigurd', 'book': 'The Red Fairy Book'} {'title': "The Magician's Horse", 'book': 'The Grey Fairy Book'} {'title': "The Bird 'Grip'", 'book': 'The Pink Fairy Book'} {'title': 'The Three Treasures Of The Giants', 'book': 'The Orange Fairy Book'} {'title': 'The Golden Mermaid', 'book': 'The Green Fairy Book'} {'title': 'Pinkel The Thief', 'book': 'The Orange Fairy Book'} {'title': 'Jesper Who Herded The Hares', 'book': 'The Violet Fairy Book'} {'title': "Ian, The Soldier's Son", 'book': 'The Orange Fairy Book'} {'title': 'The Story Of Ciccu', 'book': 'The Pink Fairy Book'} {'title': 'The White Cat', 'book': 'The Blue Fairy Book'} {'title': 'The Story Of Prince Ahmed And The Fairy Paribanou', 'book': 'The Blue Fairy Book'}
How about Jack stories?
for story in db["books"].search("Jack", columns=["title", "book"]):
print(story)
{'title': 'The History Of Jack The Giant-Killer', 'book': 'The Blue Fairy Book'} {'title': 'Jack And The Beanstalk', 'book': 'The Red Fairy Book'} {'title': 'Jack My Hedgehog', 'book': 'The Green Fairy Book'} {'title': 'The Three Treasures Of The Giants', 'book': 'The Orange Fairy Book'} {'title': 'Farmer Weatherbeard', 'book': 'The Red Fairy Book'} {'title': 'The Shirt-Collar', 'book': 'The Pink Fairy Book'} {'title': 'To The Friendly Reader', 'book': 'The Green Fairy Book'} {'title': 'Preface', 'book': 'The Red Fairy Book'} {'title': 'Preface', 'book': 'The Orange Fairy Book'} {'title': 'The Princess Mayblossom', 'book': 'The Red Fairy Book'} {'title': 'Tale Of A Tortoise And Of A Mischievous Monkey', 'book': 'The Brown Fairy Book'}
Ah... so maybe Preface is something we could also catch and exclude... And perhaps To The Friendly Reader as a special exception.
Or Hans?
for story in db["books"].search("Hans", columns=["title", "book"]):
print(story)
{'title': "Hans, The Mermaid's Son", 'book': 'The Pink Fairy Book'} {'title': 'The Headless Dwarfs', 'book': 'The Violet Fairy Book'} {'title': 'Blockhead-Hans', 'book': 'The Yellow Fairy Book'} {'title': 'The Underground Workers', 'book': 'The Violet Fairy Book'} {'title': 'The Magic Book', 'book': 'The Orange Fairy Book'} {'title': 'Preface', 'book': 'The Yellow Fairy Book'} {'title': 'The Shirt-Collar', 'book': 'The Pink Fairy Book'} {'title': 'The Goblin And The Grocer', 'book': 'The Pink Fairy Book'} {'title': 'The Flying Trunk', 'book': 'The Pink Fairy Book'} {'title': 'The Snow-Man', 'book': 'The Pink Fairy Book'} {'title': 'The Fir-Tree', 'book': 'The Pink Fairy Book'} {'title': 'The Daughter Of Buk Ettemsuch', 'book': 'The Grey Fairy Book'} {'title': 'The Story Of Halfman', 'book': 'The Violet Fairy Book'} {'title': 'Udea And Her Seven Brothers', 'book': 'The Grey Fairy Book'} {'title': 'The Ugly Duckling', 'book': 'The Orange Fairy Book'} {'title': 'The Snow-Queen', 'book': 'The Pink Fairy Book'}
I seem to recall there may have been some sources at the end of some texts? A quick text for that is to see if there is any mention of Grimm
:
for story in db["books"].search("Grimm", columns=["title", "book"]):
print(story)
{'title': 'Preface', 'book': 'The Crimson Fairy Book'} {'title': 'The Three Brothers', 'book': 'The Pink Fairy Book'} {'title': 'To The Friendly Reader', 'book': 'The Green Fairy Book'} {'title': 'Jorinde And Joringel', 'book': 'The Green Fairy Book'} {'title': 'Spindle, Shuttle, And Needle', 'book': 'The Green Fairy Book'} {'title': 'The Marvellous Musician', 'book': 'The Red Fairy Book'} {'title': 'Rumpelstiltzkin', 'book': 'The Blue Fairy Book'} {'title': 'The Twelve Huntsmen', 'book': 'The Green Fairy Book'} {'title': 'The Riddle', 'book': 'The Green Fairy Book'} {'title': 'Mother Holle', 'book': 'The Red Fairy Book'} {'title': 'The Story Of A Clever Tailor', 'book': 'The Green Fairy Book'} {'title': 'The Three Snake-Leaves', 'book': 'The Green Fairy Book'} {'title': 'The War Of The Wolf And The Fox', 'book': 'The Green Fairy Book'} {'title': 'Rapunzel', 'book': 'The Red Fairy Book'} {'title': 'The White Snake', 'book': 'The Green Fairy Book'} {'title': 'The House In The Wood', 'book': 'The Pink Fairy Book'} {'title': 'The Golden Goose', 'book': 'The Red Fairy Book'} {'title': 'The Three Dogs', 'book': 'The Green Fairy Book'} {'title': 'The Twelve Brothers', 'book': 'The Red Fairy Book'} {'title': 'The Iron Stove', 'book': 'The Yellow Fairy Book'} {'title': 'The Golden Lads', 'book': 'The Green Fairy Book'} {'title': 'The Three Dwarfs', 'book': 'The Red Fairy Book'} {'title': 'Jack My Hedgehog', 'book': 'The Green Fairy Book'} {'title': 'The Crystal Coffin', 'book': 'The Green Fairy Book'} {'title': 'Allerleirauh; Or, The Many-Furred Creature', 'book': 'The Green Fairy Book'} {'title': 'The Goose-Girl', 'book': 'The Blue Fairy Book'} {'title': 'The Three Musicians', 'book': 'The Green Fairy Book'} {'title': 'Snow-White And Rose-Red', 'book': 'The Blue Fairy Book'} {'title': 'Brother And Sister', 'book': 'The Red Fairy Book'} {'title': 'Little One-Eye, Little Two-Eyes, And Little Three-Eyes', 'book': 'The Green Fairy Book'} {'title': 'Hansel And Grettel', 'book': 'The Blue Fairy Book'} {'title': 'The Story Of The Fisherman And His Wife', 'book': 'The Green Fairy Book'} {'title': 'Trusty John', 'book': 'The Blue Fairy Book'} {'title': 'Snowdrop', 'book': 'The Red Fairy Book'} {'title': 'The Golden Mermaid', 'book': 'The Green Fairy Book'} {'title': 'The Tale Of A Youth Who Set Out To Learn What Fear Was', 'book': 'The Blue Fairy Book'}
Okay, so let's check the end of one of those:
for row in db.query('SELECT last_line FROM books WHERE text LIKE "%Grimm%"'):
print(row["last_line"][-200:])
[6] Grimm. [19] Grimm. [22] Grimm. [23] Grimm. [26] Grimm. [29] Grimm. [30] Grimm. [32] Grimm. ut them up in the cellar, but in the morning they shall be led forth into the forest and shall serve a charcoal burner until they have improved, and will never again suffer poor animals to go hungry.' ill and died the two others were so deeply grieved that they were also taken ill and died too. And so, because they had all been so clever, and so fond of each other, they were all laid in one grave. and Mrs. Skovgaard-Pedersen has done 'The Green Knight' from the Danish. I must especially thank Monsieur Macler for permitting us to use some of his Contes Armeniens (Paris: Ernest Leroux, Editeur). [1] Grimm. [1] Grimm. [1] Grimm. [1] Grimm. [1] Grimm. [1] Grimm. ^32:1 Grimm. ffer in colour; language, religion, and almost everything else; but they all love a nursery tale. The stories have mainly been adapted or translated by Mrs. Lang, a few by Miss Lang and Miss Blackley. ill not be dull. So good-bye, and when you have read a fairy book, lend it to other children who have none, or tell them the stories in your own way, which is a very pleasant mode of passing the time. Grimm. Grimm. Grimm. Grimm. Grimm. Grimm. Grimm. Grimm. Grimm. Grimm. Grimm. Grimm. Grimm. Grimm. Grimm. Grimm. Grimm.
How about some stories that don't reference Grimm?
# This query was used to help iterate the regular expressions used to extract the provenance
for row in db.query('SELECT last_line, provenance FROM books WHERE text NOT LIKE "%Grimm%" LIMIT 10'):
print(row["provenance"],"::", row["last_line"][-200:])
A. L. :: A. L. :: and dancing in summer by the light of the moon. La Princesse Printaniere. Par Mme. d'Aulnoy. :: [1] La Princesse Printaniere. Par Mme. d'Aulnoy. From P. C. Asbjornsen. :: [2] From P. C. Asbjornsen. Ralston. :: [3] Ralston. The Hibernian Tales. :: [4] The Hibernian Tales. From P. C. Asbjornsen. :: [5] From P. C. Asbjornsen. Madame d'Aulnoy. :: [7] Madame d'Aulnoy. Rumanische Marchen ubersetzt von Nite Kremnitz. :: [8] Rumanische Marchen ubersetzt von Nite Kremnitz. :: Then the King was wroth with those sons, and punished them as he thought best. And afterwards three weddings were celebrated.
for row in db.query('SELECT DISTINCT provenance, COUNT(*) AS num FROM books GROUP BY provenance ORDER BY num DESC LIMIT 10'):
print(row["num"], row["provenance"])
126 31 Grimm. 5 Lapplandische Mahrchen. 5 Japanische Marchen. 5 From 'West Highland Tales.' 5 Ehstnische Marchen. 5 Charles Perrault. 4 Volksmarchen der Serben. 4 Madame d'Aulnoy. 4 From Ungarische Mahrchen.
Hmm.. it seemed like there were more mentions of Grimm than that?
So what entities can we find in the stories...?!
Let's load in the spacy
natural language processing toolkit:
#%pip install --upgrade spacy
import spacy
nlp = spacy.load("en_core_web_sm")
And set up a database connection so we can easily run pandas mediated queries:
import pandas as pd
import sqlite3
conn = sqlite3.connect(db_name)
Get a dataframe of data frm the database:
q = "SELECT * FROM books"
df = pd.read_sql(q, conn)
df.head()
book | title | text | last_line | provenance | |
---|---|---|---|---|---|
0 | The Red Fairy Book | Preface | IN a second gleaning of the fields of Fairy La... | A. L. | A. L. |
1 | The Red Fairy Book | The Twelve Dancing Princesses | I\n\nONCE upon a time there lived in the villa... | and dancing in summer by the light of the moon. | |
2 | The Red Fairy Book | The Princess Mayblossom | ONCE upon a time there lived a King and Queen ... | [1] La Princesse Printaniere. Par Mme. d'Aulnoy. | La Princesse Printaniere. Par Mme. d'Aulnoy. |
3 | The Red Fairy Book | Soria Moria Castle | THERE was once upon a time a couple of folks w... | [2] From P. C. Asbjornsen. | From P. C. Asbjornsen. |
4 | The Red Fairy Book | The Death Of Koshchei The Deathless | IN a certain kingdom there lived a Prince Ivan... | [3] Ralston. | Ralston. |
Now let's have a go at extracting some entities:
# Extract a set of entities, rather than a list...
get_entities = lambda desc: {f"{entity.label_} :: {entity.text}" for entity in nlp(desc).ents}
# The full run takes some time....
df['entities'] = df["text"].apply(get_entities)
df.head(10)
book | title | text | last_line | provenance | entities | |
---|---|---|---|---|---|---|
0 | The Red Fairy Book | Preface | IN a second gleaning of the fields of Fairy La... | A. L. | A. L. | {PERSON :: Miss Farquharson, ORG :: La Traditi... |
1 | The Red Fairy Book | The Twelve Dancing Princesses | I\n\nONCE upon a time there lived in the villa... | and dancing in summer by the light of the moon. | {ORDINAL :: fourth, ORG :: Beloeil, TIME :: al... | |
2 | The Red Fairy Book | The Princess Mayblossom | ONCE upon a time there lived a King and Queen ... | [1] La Princesse Printaniere. Par Mme. d'Aulnoy. | La Princesse Printaniere. Par Mme. d'Aulnoy. | {QUANTITY :: only twenty feet, ORDINAL :: four... |
3 | The Red Fairy Book | Soria Moria Castle | THERE was once upon a time a couple of folks w... | [2] From P. C. Asbjornsen. | From P. C. Asbjornsen. | {QUANTITY :: six miles, CARDINAL :: one, TIME ... |
4 | The Red Fairy Book | The Death Of Koshchei The Deathless | IN a certain kingdom there lived a Prince Ivan... | [3] Ralston. | Ralston. | {PERSON :: Prince Ivan, ORG :: the Water of Li... |
5 | The Red Fairy Book | The Black Thief And Knight Of The Glen. | IN times of yore there was a King and a Queen ... | [4] The Hibernian Tales. | The Hibernian Tales. | {PERSON :: Knight, WORK_OF_ART :: The Knight o... |
6 | The Red Fairy Book | The Master Thief | THERE was once upon a time a husbandman who ha... | [5] From P. C. Asbjornsen. | From P. C. Asbjornsen. | {CARDINAL :: one, TIME :: evening, ORG :: Prie... |
7 | The Red Fairy Book | Brother And Sister | BROTHER took sister by the hand and said: `Loo... | [6] Grimm. | Grimm. | {TIME :: one night, TIME :: Next morning, TIME... |
8 | The Red Fairy Book | Princess Rosette | ONCE upon a time there lived a King and Queen ... | [7] Madame d'Aulnoy. | Madame d'Aulnoy. | {PERSON :: Frisk, CARDINAL :: one, TIME :: eve... |
9 | The Red Fairy Book | The Enchanted Pig | ONCE upon a time there lived a King who had th... | [8] Rumanische Marchen ubersetzt von Nite Krem... | Rumanische Marchen ubersetzt von Nite Kremnitz. | {ORG :: Sun, LOC :: North, CARDINAL :: one, DA... |
We should probably just do this once and add an appropriate table of entities to the database...
We can explode these out into a long format dataframe:
from pandas import Series
# Explode the entities one per row...
df_long = df.explode('entities')
df_long.rename(columns={"entities":"entity"}, inplace=True)
# And then separate out entity type and value
df_long[["entity_typ", "entity_value"]] = df_long["entity"].str.split(" :: ").apply(Series)
df_long.head()
book | title | text | last_line | provenance | entity | entity_typ | entity_value | |
---|---|---|---|---|---|---|---|---|
0 | The Red Fairy Book | Preface | IN a second gleaning of the fields of Fairy La... | A. L. | A. L. | PERSON :: Miss Farquharson | PERSON | Miss Farquharson |
0 | The Red Fairy Book | Preface | IN a second gleaning of the fields of Fairy La... | A. L. | A. L. | ORG :: La Tradition | ORG | La Tradition |
0 | The Red Fairy Book | Preface | IN a second gleaning of the fields of Fairy La... | A. L. | A. L. | PERSON :: M. Henri Carnoy | PERSON | M. Henri Carnoy |
0 | The Red Fairy Book | Preface | IN a second gleaning of the fields of Fairy La... | A. L. | A. L. | GPE :: Madame d'Aulnoy | GPE | Madame d'Aulnoy |
0 | The Red Fairy Book | Preface | IN a second gleaning of the fields of Fairy La... | A. L. | A. L. | NORP :: German | NORP | German |
And explore...
df_long["entity_typ"].value_counts()
DATE 2865 CARDINAL 2330 TIME 1783 PERSON 1467 ORG 996 ORDINAL 881 GPE 550 NORP 300 WORK_OF_ART 244 QUANTITY 204 LOC 160 FAC 88 PRODUCT 57 MONEY 45 EVENT 20 LANGUAGE 17 LAW 13 PERCENT 1 Name: entity_typ, dtype: int64
What sort of money has been identified in the stories?
df_long[df_long["entity_typ"]=="MONEY"]["entity_value"].value_counts().head(10)
a penny 10 a hundred dollars 5 three hundred dollars 4 two hundred dollars 3 ten dollars 2 the hundred dollars 2 two thousand dollars 1 a thousand pounds 1 twelve hundred pounds 1 fifty pounds 1 Name: entity_value, dtype: int64
Dollars? Really??? What about gold coins?! Do I need to train a new classifier?! Or was the original text really like that... Or has the text been got at? (Maybe I should do my own digitisation project to extract the text from copies of the original books on the Internet Archive? Hmmm.. that could be interesting for when we go on strike...)
What about other quantities?
df_long[df_long["entity_typ"]=="QUANTITY"]["entity_value"].value_counts().head(10)
a mile 9 a few miles 7 several miles 7 twenty miles 6 three miles 4 a hundred miles 4 one foot 4 a few feet 3 half a mile 3 two miles 3 Name: entity_value, dtype: int64
What people have been identified?
df_long[df_long["entity_typ"]=="PERSON"]["entity_value"].value_counts().head(10)
Queen 40 Fairy 28 Majesty 22 King 15 Lang 12 Thou 11 Peter 9 Jack 9 Campbell 8 Hush 8 Name: entity_value, dtype: int64
How about geo-political entities (GPEs)?
df_long[df_long["entity_typ"]=="GPE"]["entity_value"].value_counts().head(10)
Prince 52 Paris 10 France 9 Truly 9 Japan 8 Denmark 6 Kletke 6 Greece 6 Ireland 6 Finland 5 Name: entity_value, dtype: int64
When did things happen?
df_long[df_long["entity_typ"]=="DATE"]["entity_value"].value_counts().head(10)
one day 168 One day 130 three days 78 the day 67 the next day 58 next day 55 a few days 50 The next day 43 Next day 41 all day 35 Name: entity_value, dtype: int64
And how about time considerations?
df_long[df_long["entity_typ"]=="TIME"]["entity_value"].value_counts().head(10)
the night 91 evening 84 a few minutes 79 night 77 one morning 58 next morning 49 all night 41 one night 38 morning 37 midnight 36 Name: entity_value, dtype: int64
How were things organised?
df_long[df_long["entity_typ"]=="ORG"]["entity_value"].value_counts().head(10)
Court 41 Prince 22 Grimm 20 Princess 16 Palace 10 Fairies 8 State 8 Time 8 Troll 7 Giant 7 Name: entity_value, dtype: int64
What's a NORP
? (Ah... Nationalities Or Religious or Political groups.)
df_long[df_long["entity_typ"]=="NORP"]["entity_value"].value_counts().head(10)
German 24 Danish 18 French 13 Russian 11 Indian 10 Christian 9 Italian 8 Portuguese 7 Spanish 7 Persian 6 Name: entity_value, dtype: int64
Have other people generated data sets that can be linked in?