There was a question on Twitter about studying the interaction between characters in Sherlock Holmes:
https://twitter.com/merisamartinez/status/577795285746900992
I've been playing a lot with iPython recently for doing this kind of thing and thought I'd have a quick pass at showing what's possible. It's worth emphasizing that this example favours simplicity above most other things and has some clear weaknesses, but could form the basis for some further work.
Let's fetch the contents of Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle, this is similar to what's described in Getting Texts.
import urllib.request
url = "http://www.gutenberg.org/cache/epub/1661/pg1661.txt"
string = urllib.request.urlopen(url).read().decode()
len(string)
594916
We have the full string, but we want to isolate the main body of the text from the top header and footer on the page.
startText = string.find("THE ADVENTURES OF SHERLOCK HOLMES")
endText = string.find("End of the Project Gutenberg EBook of The Adventures of Sherlock Holmes")
filteredText = string[startText:endText]
len(filteredText)
575134
For the sake of simplicity, let's just use a double newline character (here using the Windows format of \r\n\r\n) as indicating a split with the previous paragraphs
import re
paragraphs = re.split(r'\r\n\r\n+', filteredText)
We now have a list of paragraphs. We'll want to read each paragraph to look for people, and we'll use the NLTK tagger for that purpose (see Getting NLTK for more information on installing and using this module).
import nltk
def findPeople(tree, people):
if type(tree) is nltk.tree.Tree and tree.label() == "PERSON":
people.append(" ".join([word for word, pos in tree]))
elif (type(tree) is nltk.tree.Tree) or (type(tree) is list):
[findPeople(branch, people) for branch in tree]
We'll read through each of our paragraphs and build a list of paragraphs that contain multiple people (after some simple filtering and normalization).
multi_people = []
for paragraph in paragraphs:
sentences = nltk.sent_tokenize(paragraph)
tokenizedSentences = [nltk.word_tokenize(sent) for sent in sentences]
taggedSentences = [nltk.pos_tag(sent) for sent in tokenizedSentences]
chunkedSentences = [nltk.ne_chunk(sent) for sent in taggedSentences]
people = []
findPeople(chunkedSentences, people)
people = ["Holmes" if ("Holmes" in person or "Sherlock" in person) else person for person in people]
people = [re.sub("(Mr\.|Miss)", "", person) for person in people]
people = [person.strip() for person in people if person.strip()]
people = set(people)
if len(people) > 1:
multi_people.append(people)
Now we'll go through and prepare a set of relationships that can be counted. We'll create a link between every pair of people in each paragraph. So, if people A, B and C are in a paragraph, we'll have three relationships:
Then we can count how many times these relationships repeat across paragraphs.
from collections import defaultdict
edgesDictionary = defaultdict(int)
for people in multi_people:
for personA in people:
for personB in people:
if personA < personB:
edgesDictionary[personA + " -- " + personB] += 1
edgesFreqs = nltk.FreqDist(edgesDictionary)
edgesFreqs.most_common(5) # have a peek
[('Holmes -- Watson', 19), ('Holmes -- Lestrade', 7), ('Holmes -- Hosmer Angel', 7), ('Holmes -- Lord St. Simon', 6), ('Doctor -- Holmes', 6)]
Now that we have relationships and counts, we can graph the results (see Getting Graphical and Topic Modelling for more information on this).
import networkx as nx
import matplotlib.pyplot as plt
%matplotlib inline
G = nx.Graph()
plt.figure(figsize=(10,10))
# create graph edges (node pairs) and keep track of edges for each count
edges = defaultdict(list)
for names, count in edgesFreqs.most_common():
if count > 1:
parts = names.split(" -- ")
G.add_edge(parts[0], parts[1], width=count)
edges[count].append((parts[0], parts[1]))
else:
break
# draw labels (nx.draw(G) doesn't really work)
pos = nx.spring_layout(G)
nx.draw_networkx_labels(G, pos)
# draw edges with different widths
for count, edgelist in edges.items():
g = G.subgraph(pos)
nx.draw_networkx_edges(g, pos, edgelist=edgelist, width=count, alpha=0.1)
plt.axis('off')
plt.show()
Fun, though of course we'd probably prefer an interactive graph, which is beyond the scope of this notebook :).