Full repo here: https://github.com/arnicas/NLP-in-Python
%matplotlib inline
import itertools
import math
import matplotlib.pyplot as plt
# also, down below, we use pattern, numpy, and scipy
Term Frequency: Number of appearances of a word in a document (the counts we saw already)
Document Frequency: Number of documents that contain a word in a set of docs
TF-IDF is Term Frequency / Document Frequency, with some extra fiddles.
Example from Manning, Raghavan, and Schuetze showing IDF of a rare term is high:
TF-IDF for a word and document is usually calculated as:
(Word t's frequency in the doc) * Log( Number of Docs / Number of docs that contain the word t)
However, it is usually done with a + 1 term or two. You can consider it an information measure for document words (or "features") in a bag-of-words style analysis, where the order of the words doesn't matter, just the set of words. It is a "weight" for a word. Some features of TF-IDF:
See the discussion in Manning, Raghavan, and Schuetze, and even more math in Wikipedia. Depending on implementation, TF-IDF may or may not be normalized. Always check to see if the implementation you use cleans stopwords or not and decide if you like that.
Some more python references:
In other languages than Python:
# code example from Building Machine Learning Systems with Python (Richert & Coelho)
# - modified slightly by Lynn
import math
def tfidf(t, d, D):
tf = float(d.count(t)) / sum(d.count(w) for w in set(d)) # normalized
# Note his version doesn't use +1 in denominator.
idf = math.log( float(len(D)) / (len([doc for doc in D if t in doc])))
return tf * idf
a, abb, abc = ["a"], ["a", "b", "b"], ["a", "b", "c"] # try adding another c to the last doc!
D = [a, abb, abc]
print(tfidf("a", a, D)) # a is in all of them
print(tfidf("a", abc, D)) # a is in all of them
print(tfidf("b", abc, D)) # b occurs only once here, but in 2 docs
print(tfidf("b", abb, D)) # b occurs more frequently in this doc
print(tfidf("c", abc, D)) # c is unique in the doc set
0.0 0.0 0.135155036036 0.270310072072 0.366204096223
What if you change some of those docs, or add another one? Add another c in the last doc, e.g.
Install: pip install pattern. Read the documentation here for the vector package: http://www.clips.ua.ac.be/pages/pattern-vector
from pattern.vector import Document, Model, TFIDF, TF, LEMMA, PORTER, COSINE, KMEANS, HIERARCHICAL
filelist = !ls data/stories/
filelist
['A_THE BELL.txt', 'A_THE DREAM OF LITTLE TUK.txt', 'A_THE ELDERBUSH.txt', "A_THE EMPEROR'S NEW CLOTHES.txt", 'A_THE FALSE COLLAR.txt', 'A_THE FIR TREE.txt', 'A_THE HAPPY FAMILY.txt', 'A_THE LEAP-FROG.txt', 'A_THE LITTLE MATCH GIRL.txt', 'A_THE NAUGHTY BOY.txt', 'A_THE OLD HOUSE.txt', 'A_THE REAL PRINCESS.txt', 'A_THE RED SHOES.txt', 'A_THE SHADOW.txt', 'A_THE SHOES OF FORTUNE.txt', 'A_THE SNOW QUEEN.txt', 'A_THE STORY OF A MOTHER.txt', 'A_THE SWINEHERD.txt', 'G_BEARSKIN.txt', 'G_BRIAR ROSE.txt', 'G_CATHERINE AND FREDERICK.txt', 'G_CINDERELLA.txt', 'G_DUMMLING AND THE THREE FEATHERS.txt', 'G_FAITHFUL JOHN.txt', 'G_HANSEL AND GRETHEL.txt', 'G_LITTLE ONE-EYE, TWO-EYES AND THREE-EYES.txt', 'G_LITTLE RED-CAP.txt', 'G_LITTLE SNOW-WHITE.txt', 'G_MOTHER HOLLE.txt', 'G_OH, IF I COULD BUT SHIVER!.txt', 'G_RAPUNZEL.txt', 'G_RUMPELSTILTSKIN.txt', 'G_SNOW-WHITE AND ROSE-RED.txt', 'G_THE FROG PRINCE.txt', 'G_THE GOLDEN GOOSE.txt', 'G_THE GOOSE-GIRL.txt', 'G_THE LITTLE BROTHER AND SISTER.txt', 'G_THE SIX SWANS.txt', 'G_THE THREE LITTLE MEN IN THE WOOD.txt', 'G_THE TRAVELS OF TOM THUMB.txt', 'G_THE VALIANT LITTLE TAILOR.txt', 'G_THE WATER OF LIFE.txt', 'G_THUMBLING.txt']
# Load in the stories...
def load_texts(filenames, dirpath):
""" filenames are the leaves, dirpath is the path to them with the / """
loaded_text = {}
for filen in filenames:
with open(dirpath + filen) as handle:
loaded_text[filen] = handle.read()
return loaded_text
loaded_text = load_texts(filelist, 'data/stories/')
loaded_text.items()[0]
('A_THE REAL PRINCESS.txt', 'THE REAL PRINCESS\r\n\r\nThere was once a Prince who wished to marry a Princess; but then she\r\nmust be a real Princess. He travelled all over the world in hopes of\r\nfinding such a lady; but there was always something wrong. Princesses he\r\nfound in plenty; but whether they were real Princesses it was impossible\r\nfor him to decide, for now one thing, now another, seemed to him not\r\nquite right about the ladies. At last he returned to his palace quite\r\ncast down, because he wished so much to have a real Princess for his\r\nwife.\r\n\r\nOne evening a fearful tempest arose, it thundered and lightened, and the\r\nrain poured down from the sky in torrents: besides, it was as dark as\r\npitch. All at once there was heard a violent knocking at the door, and\r\nthe old King, the Prince\'s father, went out himself to open it.\r\n\r\nIt was a Princess who was standing outside the door. What with the rain\r\nand the wind, she was in a sad condition; the water trickled down from\r\nher hair, and her clothes clung to her body. She said she was a real\r\nPrincess.\r\n\r\n"Ah! we shall soon see that!" thought the old Queen-mother; however, she\r\nsaid not a word of what she was going to do; but went quietly into the\r\nbedroom, took all the bed-clothes off the bed, and put three little peas\r\non the bedstead. She then laid twenty mattresses one upon another over\r\nthe three peas, and put twenty feather beds over the mattresses.\r\n\r\nUpon this bed the Princess was to pass the night.\r\n\r\nThe next morning she was asked how she had slept. "Oh, very badly\r\nindeed!" she replied. "I have scarcely closed my eyes the whole night\r\nthrough. I do not know what was in my bed, but I had something hard\r\nunder me, and am all over black and blue. It has hurt me so much!"\r\n\r\nNow it was plain that the lady must be a real Princess, since she had\r\nbeen able to feel the three little peas through the twenty mattresses\r\nand twenty feather beds. None but a real Princess could have had such a\r\ndelicate sense of feeling.\r\n\r\nThe Prince accordingly made her his wife; being now convinced that he\r\nhad found a real Princess. The three peas were however put into the\r\ncabinet of curiosities, where they are still to be seen, provided they\r\nare not lost.\r\n\r\nWasn\'t this a lady of real delicacy?\r\n\r\n\r\n\r\n\r\n')
def make_pattern_docs(texts):
""" texts is a dictionary! key is the name of text or filename """
from pattern.vector import Document
docs = []
# Create a pattern.vector Document object for each article, and lemmatize as it goes in
for key, val in texts.iteritems():
typestring = key[0] # will be a G or A, for Grimms or Andersen
docs.append(Document(val, name=key, type=typestring, stemmer=LEMMA))
return docs
docs = make_pattern_docs(loaded_text)
docs[1]
Document(id='P46Xguc-2', name='G_LITTLE ONE-EYE, TWO-EYES AND THREE-EYES.txt', type='G')
docs[1].keywords() # normalized counts in the document (TF)
[(0.045454545454545636, u'sister'), (0.020661157024793472, u'table'), (0.017906336088154343, u'goat'), (0.017906336088154343, u'tree'), (0.016528925619834777, u'little'), (0.015151515151515213, u'eye'), (0.015151515151515213, u'knight'), (0.015151515151515213, u'mother'), (0.013774104683195648, u'morning'), (0.013774104683195648, u'soon')]
sorted(docs[1].features)[0:10] # the words = features
[u'accompany', u'according', u'admire', u'advice', u'afterward', u'ah', u'air', u'ala', u'alm', u'angry']
# the normalized vector for the word occurrences in this document -
# these scores are the same as the keywords above.
docs[1].vector['sister']
0.045454545454545456
# TF-IDF is a property of the doc set. The "Model" object handles operations across the doc set.
mtfidf = Model(documents=docs, weight=TFIDF)
mtfidf.documents
[Document(id='P46Xguc-1', name='A_THE REAL PRINCESS.txt', type='A'), Document(id='P46Xguc-2', name='G_LITTLE ONE-EYE, TWO-EYES AND THREE-EYES.txt', type='G'), Document(id='P46Xguc-3', name='A_THE HAPPY FAMILY.txt', type='A'), Document(id='P46Xguc-4', name='G_THE FROG PRINCE.txt', type='G'), Document(id='P46Xguc-5', name='G_MOTHER HOLLE.txt', type='G'), Document(id='P46Xguc-6', name='G_THE VALIANT LITTLE TAILOR.txt', type='G'), Document(id='P46Xguc-7', name='A_THE SHOES OF FORTUNE.txt', type='A'), Document(id='P46Xguc-8', name='G_OH, IF I COULD BUT SHIVER!.txt', type='G'), Document(id='P46Xguc-9', name='A_THE STORY OF A MOTHER.txt', type='A'), Document(id='P46Xguc-10', name='G_THE TRAVELS OF TOM THUMB.txt', type='G'), Document(id='P46Xguc-11', name='G_HANSEL AND GRETHEL.txt', type='G'), Document(id='P46Xguc-12', name='A_THE SWINEHERD.txt', type='A'), Document(id='P46Xguc-13', name="A_THE EMPEROR'S NEW CLOTHES.txt", type='A'), Document(id='P46Xguc-14', name='G_THUMBLING.txt', type='G'), Document(id='P46Xguc-15', name='G_DUMMLING AND THE THREE FEATHERS.txt', type='G'), Document(id='P46Xguc-16', name='G_RUMPELSTILTSKIN.txt', type='G'), Document(id='P46Xguc-17', name='G_BEARSKIN.txt', type='G'), Document(id='P46Xguc-18', name='G_CINDERELLA.txt', type='G'), Document(id='P46Xguc-19', name='A_THE FIR TREE.txt', type='A'), Document(id='P46Xguc-20', name='G_LITTLE RED-CAP.txt', type='G'), Document(id='P46Xguc-21', name='A_THE LEAP-FROG.txt', type='A'), Document(id='P46Xguc-22', name='G_THE GOLDEN GOOSE.txt', type='G'), Document(id='P46Xguc-23', name='A_THE BELL.txt', type='A'), Document(id='P46Xguc-24', name='A_THE ELDERBUSH.txt', type='A'), Document(id='P46Xguc-25', name='G_THE SIX SWANS.txt', type='G'), Document(id='P46Xguc-26', name='A_THE RED SHOES.txt', type='A'), Document(id='P46Xguc-27', name='G_CATHERINE AND FREDERICK.txt', type='G'), Document(id='P46Xguc-28', name='G_LITTLE SNOW-WHITE.txt', type='G'), Document(id='P46Xguc-29', name='A_THE FALSE COLLAR.txt', type='A'), Document(id='P46Xguc-30', name='G_RAPUNZEL.txt', type='G'), Document(id='P46Xguc-31', name='G_THE LITTLE BROTHER AND SISTER.txt', type='G'), Document(id='P46Xguc-32', name='A_THE DREAM OF LITTLE TUK.txt', type='A'), Document(id='P46Xguc-33', name='A_THE SNOW QUEEN.txt', type='A'), Document(id='P46Xguc-34', name='G_THE GOOSE-GIRL.txt', type='G'), Document(id='P46Xguc-35', name='G_SNOW-WHITE AND ROSE-RED.txt', type='G'), Document(id='P46Xguc-36', name='A_THE SHADOW.txt', type='A'), Document(id='P46Xguc-37', name='A_THE OLD HOUSE.txt', type='A'), Document(id='P46Xguc-38', name='G_THE WATER OF LIFE.txt', type='G'), Document(id='P46Xguc-39', name='G_FAITHFUL JOHN.txt', type='G'), Document(id='P46Xguc-40', name='G_THE THREE LITTLE MEN IN THE WOOD.txt', type='G'), Document(id='P46Xguc-41', name='G_BRIAR ROSE.txt', type='G'), Document(id='P46Xguc-42', name='A_THE LITTLE MATCH GIRL.txt', type='A'), Document(id='P46Xguc-43', name='A_THE NAUGHTY BOY.txt', type='A')]
mtfidf.document_frequency('sister')
0.3488372093023256
mtfidf.inverse_document_frequency('sister')
1.0531506229959813
doc1 = mtfidf.document(name='G_LITTLE ONE-EYE, TWO-EYES AND THREE-EYES.txt') # or:
# equivalent:
doc1 = mtfidf.documents[1]
doc1.term_frequency('sister') # note this is same as doing it above on the doc object!
0.045454545454545456
doc1.tf_idf('sister')
0.047870482863453696
mtfidf.documents[4].tf('sister')
0.003105590062111801
mtfidf.documents[4].tf_idf('sister')
0.0032706541086831714
Each document is a collection of weighted words, which we'll call a vector. Vectors can be compared to each other, to compute similarity. A common metric is "cosine similarity." Image from Manning, Raghavan and Schuetze:
Reminder: Angles close to each other are near 1 in cosine, far apart are closer to 0. This means that in practice you may want to subtract from 1, so that a higher score = further away. You should think of it as cosine = distance, 1-cos = similarity. Pattern (the library) does this for you so that similarity = 1 - cos.
Another, perhaps simpler to understand, is euclidean distance (image from this article):
This is essentially the hypoteneuse between two sides of a vector triangle. Larger numbers = further apart vectors!
Links:
# Taken from the pattern.vec doc page: http://www.clips.ua.ac.be/pages/pattern-vector
from pattern.vector import Document, Model
d0 = Document('A tiger is a big yellow cat with stripes.', type='tiger')
d1 = Document('A lion is a big yellow cat with manes.', type='lion',)
d2 = Document('An elephant is a big grey animal with a slurf.', type='elephant')
d3 = Document('An elephant is an animal.', type='elephant')
print "Before model, vector for d1:", d1.vector
simple = Model(documents=[d0, d1, d2, d3], weight=TFIDF)
print "After model, vector for d1:", d1.vector # vector now weighted according to document collection!
print
print "Tiger vs lion text similarity:", simple.similarity(d0, d1) # tiger vs. lion, 1-cosine
print "Tiger vs. elephant text similarity:", simple.similarity(d0, d2) # tiger vs. elephant, 1-cosine
print "Elephant 1 vs. Elephant 2 similarity:", simple.similarity(d2, d3)
Before model, vector for d1: {u'lion': 0.25, u'manes': 0.25, u'yellow': 0.25, u'cat': 0.25} After model, vector for d1: {u'lion': 0.34657382340379694, u'manes': 0.34657382340379694, u'yellow': 0.17328691170189847, u'cat': 0.17328691170189847} Tiger vs lion text similarity: 0.2 Tiger vs. elephant text similarity: 0.0 Elephant 1 vs. Elephant 2 similarity: 0.4472135955
Notice above that the document vectors changed after the model was created, even outside the model context. Be alert to this (I'm not sure I like it, personally.)
I'm going to save the simple model out for use later on...
# this exports the array of tf-idf, but with some extra stuff we can parse out. Will be large for real data.
simple.export('data/csv/simple_tfidf.tsv')
mtfidf.similarity(docs[1], docs[1]) # similarity to self is 1.
1.0000000000000002
mtfidf.similarity(mtfidf.docs[1], mtfidf.docs[6]) # try some different docs
0.061193610958748514
# check what that was
mtfidf.docs[6]
Document(id='P46Xguc-7', name='A_THE SHOES OF FORTUNE.txt', type='A')
docs[1]
Document(id='P46Xguc-2', name='G_LITTLE ONE-EYE, TWO-EYES AND THREE-EYES.txt', type='G')
mtfidf.neighbors(docs[1]) # finds the closest matches in similarity
[(0.13327973883571262, Document(id='P46Xguc-31', name='G_THE LITTLE BROTHER AND SISTER.txt', type='G')), (0.09115800288880449, Document(id='P46Xguc-40', name='G_THE THREE LITTLE MEN IN THE WOOD.txt', type='G')), (0.07103621399884713, Document(id='P46Xguc-25', name='G_THE SIX SWANS.txt', type='G')), (0.06968168677071662, Document(id='P46Xguc-22', name='G_THE GOLDEN GOOSE.txt', type='G')), (0.0647975008434651, Document(id='P46Xguc-18', name='G_CINDERELLA.txt', type='G')), (0.06260321618889808, Document(id='P46Xguc-17', name='G_BEARSKIN.txt', type='G')), (0.061193610958748514, Document(id='P46Xguc-7', name='A_THE SHOES OF FORTUNE.txt', type='A')), (0.0597097979694477, Document(id='P46Xguc-35', name='G_SNOW-WHITE AND ROSE-RED.txt', type='G')), (0.05611577026931497, Document(id='P46Xguc-24', name='A_THE ELDERBUSH.txt', type='A')), (0.05292807171289243, Document(id='P46Xguc-19', name='A_THE FIR TREE.txt', type='A'))]
# Model.search() returns a sorted list of (similarity, Document)-tuples,
# based on a list of query words. A Document is created on-the-fly for the
# given list, using the given optional arguments.
mtfidf.search(['witch','girl','boy'])
[(0.12027675752727798, Document(id='P46Xguc-11', name='G_HANSEL AND GRETHEL.txt', type='G')), (0.10224370709271782, Document(id='P46Xguc-31', name='G_THE LITTLE BROTHER AND SISTER.txt', type='G')), (0.08604756672847662, Document(id='P46Xguc-16', name='G_RUMPELSTILTSKIN.txt', type='G')), (0.07588740938904971, Document(id='P46Xguc-24', name='A_THE ELDERBUSH.txt', type='A')), (0.07152974843469193, Document(id='P46Xguc-37', name='A_THE OLD HOUSE.txt', type='A')), (0.05359044997199823, Document(id='P46Xguc-43', name='A_THE NAUGHTY BOY.txt', type='A')), (0.05015018254228086, Document(id='P46Xguc-40', name='G_THE THREE LITTLE MEN IN THE WOOD.txt', type='G')), (0.039382097289471396, Document(id='P46Xguc-8', name='G_OH, IF I COULD BUT SHIVER!.txt', type='G')), (0.03694252146876888, Document(id='P46Xguc-5', name='G_MOTHER HOLLE.txt', type='G')), (0.03499167423176321, Document(id='P46Xguc-23', name='A_THE BELL.txt', type='A'))]
Try your own searches now!
# You can do hierchical clustering right inside pattern, without having to use scipy for it.
# k is the number of "clusters" you want to produce
hier = mtfidf.cluster(method=HIERARCHICAL, k=5)
hier.depth
31
# Get a giant listing of the Cluster objects in the tree structure.
# Doesn't seem to be a built in tool to vis them, though!
hier
Cluster([Document(id='P46Xguc-3', name='A_THE HAPPY FAMILY.txt', type='A'), Document(id='P46Xguc-29', name='A_THE FALSE COLLAR.txt', type='A'), Document(id='P46Xguc-27', name='G_CATHERINE AND FREDERICK.txt', type='G'), Document(id='P46Xguc-30', name='G_RAPUNZEL.txt', type='G'), Cluster([Document(id='P46Xguc-36', name='A_THE SHADOW.txt', type='A'), Cluster([Document(id='P46Xguc-21', name='A_THE LEAP-FROG.txt', type='A'), Cluster([Document(id='P46Xguc-11', name='G_HANSEL AND GRETHEL.txt', type='G'), Cluster([Document(id='P46Xguc-43', name='A_THE NAUGHTY BOY.txt', type='A'), Cluster([Cluster([Document(id='P46Xguc-12', name='A_THE SWINEHERD.txt', type='A'), Document(id='P46Xguc-13', name="A_THE EMPEROR'S NEW CLOTHES.txt", type='A')]), Cluster([Document(id='P46Xguc-1', name='A_THE REAL PRINCESS.txt', type='A'), Cluster([Document(id='P46Xguc-15', name='G_DUMMLING AND THE THREE FEATHERS.txt', type='G'), Cluster([Document(id='P46Xguc-5', name='G_MOTHER HOLLE.txt', type='G'), Cluster([Document(id='P46Xguc-39', name='G_FAITHFUL JOHN.txt', type='G'), Cluster([Document(id='P46Xguc-32', name='A_THE DREAM OF LITTLE TUK.txt', type='A'), Cluster([Document(id='P46Xguc-4', name='G_THE FROG PRINCE.txt', type='G'), Cluster([Cluster([Document(id='P46Xguc-6', name='G_THE VALIANT LITTLE TAILOR.txt', type='G'), Document(id='P46Xguc-10', name='G_THE TRAVELS OF TOM THUMB.txt', type='G')]), Cluster([Document(id='P46Xguc-42', name='A_THE LITTLE MATCH GIRL.txt', type='A'), Cluster([Cluster([Document(id='P46Xguc-14', name='G_THUMBLING.txt', type='G'), Document(id='P46Xguc-20', name='G_LITTLE RED-CAP.txt', type='G')]), Cluster([Document(id='P46Xguc-19', name='A_THE FIR TREE.txt', type='A'), Cluster([Document(id='P46Xguc-22', name='G_THE GOLDEN GOOSE.txt', type='G'), Cluster([Document(id='P46Xguc-34', name='G_THE GOOSE-GIRL.txt', type='G'), Cluster([Document(id='P46Xguc-33', name='A_THE SNOW QUEEN.txt', type='A'), Cluster([Document(id='P46Xguc-8', name='G_OH, IF I COULD BUT SHIVER!.txt', type='G'), Cluster([Document(id='P46Xguc-41', name='G_BRIAR ROSE.txt', type='G'), Cluster([Document(id='P46Xguc-2', name='G_LITTLE ONE-EYE, TWO-EYES AND THREE-EYES.txt', type='G'), Cluster([Document(id='P46Xguc-9', name='A_THE STORY OF A MOTHER.txt', type='A'), Cluster([Document(id='P46Xguc-24', name='A_THE ELDERBUSH.txt', type='A'), Cluster([Cluster([Document(id='P46Xguc-37', name='A_THE OLD HOUSE.txt', type='A'), Document(id='P46Xguc-17', name='G_BEARSKIN.txt', type='G')]), Cluster([Cluster([Document(id='P46Xguc-31', name='G_THE LITTLE BROTHER AND SISTER.txt', type='G'), Cluster([Document(id='P46Xguc-25', name='G_THE SIX SWANS.txt', type='G'), Cluster([Document(id='P46Xguc-40', name='G_THE THREE LITTLE MEN IN THE WOOD.txt', type='G'), Cluster([Document(id='P46Xguc-38', name='G_THE WATER OF LIFE.txt', type='G'), Cluster([Document(id='P46Xguc-35', name='G_SNOW-WHITE AND ROSE-RED.txt', type='G'), Cluster([Document(id='P46Xguc-28', name='G_LITTLE SNOW-WHITE.txt', type='G'), Document(id='P46Xguc-16', name='G_RUMPELSTILTSKIN.txt', type='G')])])])])])]), Cluster([Document(id='P46Xguc-23', name='A_THE BELL.txt', type='A'), Cluster([Document(id='P46Xguc-7', name='A_THE SHOES OF FORTUNE.txt', type='A'), Cluster([Document(id='P46Xguc-18', name='G_CINDERELLA.txt', type='G'), Document(id='P46Xguc-26', name='A_THE RED SHOES.txt', type='A')])])])])])])])])])])])])])])])])])])])])])])])])])])])])])
# Look at some of the functions on hier...
hier
import csv
import numpy as np
from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import linkage, dendrogram
def read_weka_tfidf(filen):
""" Read in the Weka file output by pattern's model export and just keep tfidf scores."""
rows = []
with open(filen, 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter='\t')
count = 0
for row in spamreader:
# skipping first row which is the word labels
if count > 0:
rows.append(row[:-2]) # skip extra junk at the last 2 cols
count += 1
return rows
simplerows = read_weka_tfidf('data/csv/simple_tfidf.tsv')
simplerows
[['0', '0.1733', '0', '0', '0', '0', '0', '0.3466', '0.3466', '0.1733'], ['0', '0.1733', '0', '0', '0.3466', '0.3466', '0', '0', '0', '0.1733'], ['0.1733', '0', '0.1733', '0.3466', '0', '0', '0.3466', '0', '0', '0'], ['0.3466', '0', '0.3466', '0', '0', '0', '0', '0', '0', '0']]
Scipy's pdist is pairwise distance - see http://docs.scipy.org/doc/scipy/reference/spatial.distance.html You can use cosine here as well! or a host of other options...
dist = pdist(simplerows, metric='cosine') # look at the manpage and pick a different measure to try
linkage(dist)
array([[ 2. , 3. , 0.5527864, 2. ], [ 0. , 1. , 0.8 , 2. ], [ 4. , 5. , 1. , 4. ]])
from pylab import rcParams
rcParams['figure.figsize'] = 6, 5
dendrogram(linkage(dist)) # this plotting function has a ton of things you can manipulate if you look at the docs.
{'color_list': ['g', 'b', 'b'], 'dcoord': [[0.0, 0.55278640450004202, 0.55278640450004202, 0.0], [0.0, 0.79999999999999993, 0.79999999999999993, 0.0], [0.55278640450004202, 1.0, 1.0, 0.79999999999999993]], 'icoord': [[5.0, 5.0, 15.0, 15.0], [25.0, 25.0, 35.0, 35.0], [10.0, 10.0, 30.0, 30.0]], 'ivl': ['2', '3', '0', '1'], 'leaves': [2, 3, 0, 1]}
# Reminder:
print "d0", d0.words
print "d1", d1.words
print "d2", d2.words
print "d3", d3.words
# show the distances, which are used to get the hierarchy:
print "d2, d3 distance", 1-simple.similarity(d2,d3)
print "d0, d1 distance", 1-simple.similarity(d0,d1)
d0 {u'tiger': 1, u'stripes': 1, u'yellow': 1, u'cat': 1} d1 {u'lion': 1, u'manes': 1, u'yellow': 1, u'cat': 1} d2 {u'slurf': 1, u'grey': 1, u'animal': 1, u'elephant': 1} d3 {u'animal': 1, u'elephant': 1} d2, d3 distance 0.5527864045 d0, d1 distance 0.8
mtfidf.export('data/csv/fairy.tsv')
fairyrows = read_weka_tfidf('data/csv/fairy.tsv')
len(fairyrows)
43
len(fairyrows[0]) # words in the vector
5292
def make_dend(data, labels=None, height=6):
from pylab import rcParams
dist = pdist(data, metric='cosine')
link = linkage(dist, method='complete')
rcParams['figure.figsize'] = 6, height
rcParams['axes.labelsize'] = 5
if not labels:
dend = dendrogram(link, orientation='right') #labels=names)
else:
dend = dendrogram(link, orientation='right', labels=labels)
return dist
# if you want to label by doc names
names = [doc.name for doc in docs]
dist = make_dend(fairyrows, height=15)
1-mtfidf.similarity(docs[12], docs[11])
0.7700672032114109
1-mtfidf.similarity(docs[25], docs[17])
0.7895033309691184
# Code borrowed from: http://nbviewer.ipython.org/github/OxanaSachenkova/hclust-python/blob/master/hclust.ipynb
def make_heatmap_matrix(dist, method='complete'):
""" Pass in the distance matrix; method options are complete or single """
# Compute and plot first dendrogram.
fig = plt.figure(figsize=(10,10))
# x ywidth height
ax1 = fig.add_axes([0.05,0.1,0.2,0.6])
Y = linkage(dist, method=method)
Z1 = dendrogram(Y, orientation='right') # adding/removing the axes
ax1.set_xticks([])
# Compute and plot second dendrogram.
ax2 = fig.add_axes([0.3,0.71,0.6,0.2])
Z2 = dendrogram(Y)
ax2.set_xticks([])
ax2.set_yticks([])
#Compute and plot the heatmap
axmatrix = fig.add_axes([0.3,0.1,0.6,0.6])
idx1 = Z1['leaves']
idx2 = Z2['leaves']
D = squareform(dist)
D = D[idx1,:]
D = D[:,idx2]
im = axmatrix.matshow(D, aspect='auto', origin='lower', cmap=plt.cm.YlGnBu)
axmatrix.set_xticks([])
axmatrix.set_yticks([])
# Plot colorbar.
axcolor = fig.add_axes([0.91,0.1,0.02,0.6])
plt.colorbar(im, cax=axcolor)
make_heatmap_matrix(dist, method='complete')
books = !ls data/books
books
['anderson.txt', 'grimms.txt', 'irishfairy.txt', 'lovecraft.txt', 'mrjames.txt', 'poe.txt']
booktexts = load_texts(books, 'data/books/')
bookdocs = make_pattern_docs(booktexts)
booktfidf = Model(documents=bookdocs, weight=TFIDF)
booktfidf.docs
[Document(id='P46Xguc-49', name='lovecraft.txt', type='l'), Document(id='P46Xguc-50', name='irishfairy.txt', type='i'), Document(id='P46Xguc-51', name='poe.txt', type='p'), Document(id='P46Xguc-52', name='anderson.txt', type='a'), Document(id='P46Xguc-53', name='grimms.txt', type='g'), Document(id='P46Xguc-54', name='mrjames.txt', type='m')]
booknames = [doc.name for doc in booktfidf.docs]
booktfidf.export('data/csv/books_tfidf.tsv')
bookweights = read_weka_tfidf('data/csv/books_tfidf.tsv')
dist = make_dend(bookweights, labels=booknames)
make_heatmap_matrix(dist, method='complete')
kmeans = mtfidf.cluster(method=KMEANS, k=5)
from pattern.vector import centroid
import operator
# For each cluster center, look at the most important features.
for i in range(5):
print i
print sorted(centroid(kmeans[i]).items(), key=operator.itemgetter(1))[0:10]
print
0 [(u'angry', 0.00013166062899110716), (u'kingdom', 0.00013166062899110716), (u'directly', 0.00014974707549083753), (u'killed', 0.000159956090569813), (u'lifted', 0.0001711388013424306), (u'filled', 0.0001711388013424306), (u'happen', 0.0001711388013424306), (u'field', 0.00018069944297765), (u'power', 0.00018350071458059924), (u'creature', 0.00018350071458059924)] 1 [(u'pas', 2.976767638983525e-05), (u'kept', 3.171778571726749e-05), (u'thank', 3.8534321488072485e-05), (u'fallen', 4.122829938216276e-05), (u'aside', 4.753553535300498e-05), (u'built', 4.753553535300498e-05), (u'laugh', 4.753553535300498e-05), (u'take', 5.130985012857345e-05), (u'tower', 5.130985012857345e-05), (u'milk', 5.130985012857345e-05)] 2 [(u'stone', 2.7289350720252417e-05), (u'bush', 3.307147399482816e-05), (u'bottom', 3.5326123547017947e-05), (u'coat', 3.5326123547017947e-05), (u'promised', 3.7795812703190714e-05), (u'follow', 3.7795812703190714e-05), (u'fallen', 3.7795812703190714e-05), (u'hundred', 3.7795812703190714e-05), (u'straight', 3.7795812703190714e-05), (u'meant', 3.7795812703190714e-05)] 3 [(u'jumped', 0.00013121993719606768), (u'clothe', 0.00013809336247776646), (u'afraid', 0.00013979242907895495), (u'rich', 0.0001565430829707474), (u'fly', 0.00016324393598442835), (u'charming', 0.00016324393598442835), (u'joy', 0.00016676991539243746), (u'chamber', 0.00016698564596595334), (u'danced', 0.00017416680191162777), (u'dres', 0.00018604063317774515)] 4 [(u'glad', 0.00015192739693528455), (u'quickly', 0.00016198455051392364), (u'sitting', 0.00016593539320938522), (u'standing', 0.00017261670309720806), (u'tear', 0.00017261670309720806), (u'able', 0.0002006765668818562), (u'light', 0.0002031540553618791), (u'led', 0.00020873205745744166), (u'hard', 0.00020873205745744166), (u'warm', 0.00020873205745744166)]
Relevant links: