#!/usr/bin/env python # coding: utf-8 # # A Full Text Searchable Database of Lang's Fairy Books # # In the late 19th and early 20th century, Andrew Lang published various collections of fairy tales, starting with *The Blue Fairy Book* and then progressing though various other colours to *The Olive Fairy Book*. # # This notebook represents a playful aside in trying to build various searchable contexts over the stories. # # To begin with, let's start by ingesting the stories into a database and building a full text search over them... # ## Obtain Source Texts # # We can download the raw text for each of Lang's Fairy Books from the Sacred Texts website: # In[105]: import requests url = "https://www.sacred-texts.com/neu/lfb/index.htm" html = requests.get(url) html.text[:1000] # By inspection of the HTML, we see the books are in `span` tag with a `ista-content` class. Digging further, we then notice the links are in `c_t` classed `span` elements. We can extract them using beautiful soup: # In[131]: from bs4 import BeautifulSoup soup = BeautifulSoup(html.content, "html.parser") # Find the spans containing the links items_ = soup.find("span", class_="ista-content").find_all("span", class_="c_t") # And then reduce those to just the links items_ = [item.find("a")for item in items_] # We can then extract the relative links and generate full links for each book page: # In[134]: base_url = url.replace("index.htm", "") links = [(link.text, f"{base_url}{link.get('href')}") for link in items_] links[:3] # We could now load each of those pages and then scrape the download link. But, we notice that the download links have a regular pattern: `https://www.sacred-texts.com/neu/lfb/bl/blfb.txt.gz` which we can derive from the book pages: # In[139]: download_links = [] for (_title, _url) in links: # We need to get the "short" colour name of the book # which can be found in the URL path... book_path = _url.split("/")[-2] zip_fn = f"{book_path}fb.txt.gz" zip_url = _url.replace("index.htm", zip_fn) download_links.append((_title, zip_url)) download_links[:3] # Now we can download and unzip the files... # In[165]: import urllib for (_, url) in download_links: # Create a file name to save file to as the file downloaded from the URL zip_file = url.split("/")[-1] urllib.request.urlretrieve(url, zip_file) # In[166]: get_ipython().system('ls') # The following function will read in the contents of a local gzip file: # In[1]: import gzip def gzip_txt(fn): """Open gzip file and extract text.""" with gzip.open(fn,'rb') as f: txt = f.read().decode('UTF-8').replace("\r", "") return txt gzip_txt('gnfb.txt.gz')[:1000] # In[179]: get_ipython().system('ls') # Select one of the books and read in the book text: # In[35]: txt = gzip_txt('blfb.txt.gz') # Preview the first 1500 characters txt[:1500] # ## Extract Stories # # Having got the contents, let's now extract all the stories. # # Within each book, the stories are delimited by a pattern `[fNN]` (for digits `N`). We can use this pattern to split out the stories. # In[36]: import re # Split on the patter: [fNN] stories = re.split("\[f\d{2}\]", txt) # Strip whitespace at start and end stories = [s.strip("\n") for s in stories] # ## Extract the contents # # The contents appear in the first "story chunk" (index `0`) in the text: # In[37]: stories[0] # Let's pull out the book name: # In[38]: # The name appears before the first comma book = stories[0].split(",")[0] book # Alternatively, we could extract against a pattern: # In[376]: #The Blue Fairy Book, by Andrew Lang, [1889], at sacred-texts.com metadata = parse.parse("{title}, by Andrew Lang, [{date}]{}, at sacred-texts.com", stories[0]) metadata["title"], metadata["date"] # There are plenty of cribs to help us pull out the contents, although it may not be obviously clear with the early content items whether they are stories or not... # In[39]: # There is a Contents header, but it may be cased... # So split in a case insensitive way boilerplate = re.split('(Contents|CONTENTS)', stories[0]) boilerplate # In[40]: # The name of the book repeats at the end of the content block # So snip it out... contents_ = boilerplate[-1].split(book)[0].strip("\n") contents_ # We can parse out titles from the contents list based on a pattern: # In[41]: import parse contents = parse.findall("[*{}]", contents_) # The title text available as item.fixed[0] # Also convert the title to title case titles = [item.fixed[0].title() for item in contents] titles # ## Coping With Page Numbers # # There seems to be work in progress adding page numbers to books using a pattern of the form `[p. ix]`, `[p. 1]`, `[p. 11]` and so on. # # For now, let's create a regular expression substitution to remove those... # In[94]: example = """[f01] [p. ix] THE YELLOW FAIRY BOOK THE CAT AND THE MOUSE IN PARTNERSHIP A cat had made acquaintance with a mouse, and had spoken so much of the great love and friendship she felt for her, that at last the Mouse consented to live in the same house with her, and to go shares in the housekeeping. 'But we must provide for the winter or else we shall suffer hunger,' said the Cat. 'You, little Mouse, cannot venture everywhere in case you run at last into a trap.' This good counsel was followed, and a little pot of fat was bought. But they did not know where to put it. At length, after long consultation, the Cat said, 'I know of no place where it could be better put than in the church. No one will trouble to take it away from there. We will hide it in a corner, and we won't touch it till we are in want.' So the little pot was placed in safety; but it was not long before the Cat had a great longing for it, and said to the Mouse, 'I wanted to tell you, little Mouse, that my cousin has a little son, white with brown spots, and she wants me to be godmother to it. Let me go out to-day, and do you take care of the house alone.' [p. 1] 'Yes, go certainly,' replied the Mouse, 'and when you eat anything good, think of me; I should very much like a drop of the red christening wine.' But it was all untrue. The Cat had no cousin, and had not been asked to be godmother. She went straight to the church, slunk to the little pot of fat, began to lick it, and licked the top off. Then she took a walk on the roofs of the town, looked at the view, stretched [P. 22] herself out in the sun, and licked her lips whenever she thought of the little pot of fat. As soon as it was evening she went home again. """ # Example of regex to remove page numbers re.sub(r'\n*\[[pP]\. [^\]\s]*\]\n\n', '', example) # ## Pulling the Parser Together # # Let's create a function to parse the book for us by pulling together all the previous fragments: # In[42]: def parse_book(txt): """Parse book from text.""" # Get story chunks stories = re.split("\[f\d{2}\]", txt) stories = [s.strip("\n") for s in stories] # Get book name book = stories[0].split(",")[0] # Process contents boilerplate = re.split('(Contents|CONTENTS)', stories[0]) # The name of the book repeats at the end of the content block # So snip it out... contents_ = boilerplate[-1].split(book)[0].strip("\n") # Get titles from contents titles = [item.fixed[0].title() for item in contents] return book, stories, titles # ## Create Simple Database Structure # # Let's create a simple database structure and configure it for full text search: # In[298]: from sqlite_utils import Database # While developing the script, recreate database each time... db_name = "demo.db" db = Database(db_name, recreate=True) # This schema has been evolved iteratively as I have identified structure # that can be usefully mined... db["books"].create({ "book": str, "title": str, "text": str, "last_line": str, # sometimes contains provenance "provenance": str, # attempt at provenance }, pk=("book", "title")) db["books"].enable_fts(["title", "text"], create_triggers=True) # ## Build Database # # Let's now create a function that can populate our database based on the contents of one of the books: # In[299]: def extract_book_stories(db_tbl, book, stories, titles=None, quiet=False): book_items = [] # The titles are from the contents list # We will actually grab titles from the story # but the titles grabbed from the contents can be passed in # if we want to write a check against them. # Note: there may be punctation differnces in the title in the contents # and the actual title in the text for story in stories[1:]: # Remove the page numbers for now... story = re.sub(r'\n*\[[pP]\. [^\]\s]*\]\n\n', '', story).strip("\n") # Other cleaning story = re.sub(r'\[\*\d+\s*\]', '', story) # Get the title from the start of the story text story_ = story.split("\n\n") title_ = story_[0].strip() # Force the title case variant of the title title = title_.title().replace("'S", "'s") # Optionally display the titles and the book if not quiet: print(f"{title} :: {book}") # Reassemble the story text = "\n\n".join(story_[1:]) # Clean out the name of the book if it is in the text #The Green Fairy Book, by Andrew Lang, [1892], at sacred-texts.com name_ignorecase = re.compile(f"{book}, by Andrew Lang, \[\d*\], at sacred-texts.com", re.IGNORECASE) text = name_ignorecase.sub('', text).strip() last_line = text.split("\n")[-1] provenance_1 = parse.parse('[{}] {provenance}', last_line) provenance_2 = parse.parse('[{provenance}]', last_line) provenance_3 = parse.parse('({provenance})', last_line) provenance_4 = {"provenance":last_line} if len(last_line.split())<7 else {} # Heuristic provenance_ = provenance_1 or provenance_2 or provenance_3 or provenance_4 provenance = provenance_["provenance"] if provenance_ else "" book_items.append({"book": book, "title": title, "text": text, "last_line": last_line, "provenance": provenance}) db_tbl.upsert_all(book_items, pk=("book", "title" )) # We can add the data for a particular book by passing in the titles and stories: # In[300]: book, stories, titles = parse_book(txt) extract_book_stories(db["books"], book, stories) # We can run a full text search over the stories. For example, if we are looking for a story with a king and three sons: # In[301]: q = 'king "three sons"' for story in db["books"].search(db.quote_fts(q), columns=["title", "book"]): print(story) # We can also churn through all the books and add them to the database: # In[302]: import os for fn in [fn for fn in os.listdir() if fn.endswith(".gz")]: # Read in book from gzip file txt = gzip_txt(fn) # Parse book book, stories, titles = parse_book(txt) # Extract stories and add them to the database extract_book_stories(db["books"], book, stories, quiet=True) # How many stories do we now have with a king and three sons? # In[303]: print(f"Search on: {q}\n") for story in db["books"].search(db.quote_fts(q), columns=["title", "book"]): print(story) # How about Jack stories? # In[304]: for story in db["books"].search("Jack", columns=["title", "book"]): print(story) # Ah... so maybe *Preface* is something we could also catch and exclude... And perhaps *To The Friendly Reader* as a special exception. # Or Hans? # In[305]: for story in db["books"].search("Hans", columns=["title", "book"]): print(story) # I seem to recall there may have been some sources at the end of some texts? A quick text for that is to see if there is any mention of `Grimm`: # In[306]: for story in db["books"].search("Grimm", columns=["title", "book"]): print(story) # Okay, so let's check the end of one of those: # In[307]: for row in db.query('SELECT last_line FROM books WHERE text LIKE "%Grimm%"'): print(row["last_line"][-200:]) # How about some stories that don't reference Grimm? # In[308]: # This query was used to help iterate the regular expressions used to extract the provenance for row in db.query('SELECT last_line, provenance FROM books WHERE text NOT LIKE "%Grimm%" LIMIT 10'): print(row["provenance"],"::", row["last_line"][-200:]) # In[309]: for row in db.query('SELECT DISTINCT provenance, COUNT(*) AS num FROM books GROUP BY provenance ORDER BY num DESC LIMIT 10'): print(row["num"], row["provenance"]) # Hmm.. it seemed like there were more mentions of Grimm than that? # ## Entity Extraction... # # So what entities can we find in the stories...?! # # Let's load in the `spacy` natural language processing toolkit: # In[310]: #%pip install --upgrade spacy import spacy nlp = spacy.load("en_core_web_sm") # And set up a database connection so we can easily run *pandas* mediated queries: # In[311]: import pandas as pd import sqlite3 conn = sqlite3.connect(db_name) # Get a dataframe of data frm the database: # In[312]: q = "SELECT * FROM books" df = pd.read_sql(q, conn) df.head() # Now let's have a go at extracting some entities: # In[347]: # Extract a set of entities, rather than a list... get_entities = lambda desc: {f"{entity.label_} :: {entity.text}" for entity in nlp(desc).ents} # The full run takes some time.... df['entities'] = df["text"].apply(get_entities) df.head(10) # *We should probably just do this once and add an appropriate table of entities to the database...* # # We can explode these out into a long format dataframe: # In[349]: from pandas import Series # Explode the entities one per row... df_long = df.explode('entities') df_long.rename(columns={"entities":"entity"}, inplace=True) # And then separate out entity type and value df_long[["entity_typ", "entity_value"]] = df_long["entity"].str.split(" :: ").apply(Series) df_long.head() # And explore... # In[350]: df_long["entity_typ"].value_counts() # What sort of money has been identified in the stories? # In[355]: df_long[df_long["entity_typ"]=="MONEY"]["entity_value"].value_counts().head(10) # Dollars? Really??? What about gold coins?! Do I need to train a new classifier?! Or was the original text really like that... Or has the text been got at? *(Maybe I should do my own digitisation project to extract the text from copies of the original books on the Internet Archive? Hmmm.. that could be interesting for when we go on strike...)* # # What about other quantities? # In[356]: df_long[df_long["entity_typ"]=="QUANTITY"]["entity_value"].value_counts().head(10) # What people have been identified? # In[358]: df_long[df_long["entity_typ"]=="PERSON"]["entity_value"].value_counts().head(10) # How about geo-political entities (GPEs)? # In[359]: df_long[df_long["entity_typ"]=="GPE"]["entity_value"].value_counts().head(10) # When did things happen? # In[360]: df_long[df_long["entity_typ"]=="DATE"]["entity_value"].value_counts().head(10) # And how about time considerations? # In[361]: df_long[df_long["entity_typ"]=="TIME"]["entity_value"].value_counts().head(10) # How were things organised? # In[362]: df_long[df_long["entity_typ"]=="ORG"]["entity_value"].value_counts().head(10) # What's a `NORP`? (Ah... *Nationalities Or Religious or Political groups.)* # In[364]: df_long[df_long["entity_typ"]=="NORP"]["entity_value"].value_counts().head(10) # ## Other Things to Link In # # Have other people generated data sets that can be linked in? # # - http://www.mythfolklore.net/andrewlang/indexbib.htm /via @OnlineCrsLady