Topic Modeling¶

This is a series of notebooks (in progress) to document my learning, and hopefully to help others learn machine learning. I would love suggestions / corrections / feedback for these notebooks.

Visit my webpage for more.

Email me: email.ryan.kelly@gmail.com

I'd love for you to share if you liked this post.

In [1]:

social()

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-d893cf13b2e6> in <module>()
----> 1 social()

NameError: name 'social' is not defined

This notebook covers of includes:¶

Latent Dirichlet Allocation (LDA)
More

Latend Dirichlet Allocation (LDA)¶

Topic modeling enables classification of objects into several categeories, unlike other clustering techniques that decide upon one final cluster for each observation. For example, this notebook could be both about the python computer language, and machine learning at the same time.

LDA, which shares no similarities with linear discriminant analysis, is a topic modeling method. LDA is the foundation behind many other topic modeling algorithms. We focus on understanding the algorithm at a high level.

The problems we are interesting in solving are those where the topics are unknown. We want to be able to take a large body of text and discover what topics are out there, and how we could classify sub-documents within the corpus.

Building a Topic Model¶

Our regular tool, scikit learn does how have LDA implimented. We will be using the Gensim library. For data we use a set of news reports from the Associated Press. The data is available here.

In [2]:

from gensim import corpora, models, similarities

corpus = corpora.BleiCorpus('/Users/ryankelly/anaconda/notebooks/Machine Learning/ap.dat', 
                            '/Users/ryankelly/anaconda/notebooks/Machine Learning/vocab.txt')

In [34]:

model = models.ldamodel.LdaModel(
        corpus, 
        num_topics = 100,
        id2word = corpus.id2word,
        iterations = 1000,
        alpha = 1.0 / len(corpus))

WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy

Once the model is built, we can explore the topics. We can see the list of topics a document now belongs to with model[doc]. The output is an index of (topic_index, topic_weight).

In [35]:

topics = [model[c] for c in corpus]
print topics[0]

[(4, 0.23314423511090068), (5, 0.018313654408029217), (6, 0.28300444969936761), (55, 0.05306253050587606), (57, 0.15568182595578678), (69, 0.027465586997892498), (78, 0.12406253738676359), (86, 0.016813172574823652), (91, 0.010707110824728267), (95, 0.063657270605180533)]

In [36]:

import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

plt.hist([len(t) for t in topics], np.arange(42))
plt.ylabel('Nr of documents')
plt.xlabel('Nr of topics')

Out[36]:

<matplotlib.text.Text at 0x10c6a5650>

Here we can see most documents belong to about 6-9 topics, and very few documents had ~20+ topics. The sensitivity to how many topics belong to each document is largely governed by the alpha parameter. Bigger values for alpha will result in more topics per document. Generally alpha is smaller than 1.0, and defaults to 1.0 / len (corpus).

In [41]:

model2 = models.ldamodel.LdaModel(corpus, num_topics=100, id2word=corpus.id2word, alpha=1.0)
topics2 = [model2[c] for c in corpus]

WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy

In [43]:

plt.hist([[len(t) for t in topics], 
          [len(t) for t in topics2]], np.arange(55))
plt.ylabel('Nr of documents')
plt.xlabel('Nr of topics')
plt.text(2,265, r'default alpha')
plt.text(20,186, 'alpha=1.0')

Out[43]:

<matplotlib.text.Text at 0x108cf5b10>

What exactly are these topics? Technically they are multinomial distributions over words, which means that each word in the vocabulary is given a probability. Words with high probability are more associated with that topic than words with lower probability. To summarize the topics we can present a list of the most highly weighted words.

Below we print the first two topics and the highest probability words.

In [49]:

for ti in xrange(2):
    words = model.show_topic(ti, 64)
    tf = sum(f for f,w in words)
    print('\n'.join('{}:{}'.format(w, int(1000.*f/tf)) for f,w in words))
    print()
    print()
    print()

i:40
biography:39
iraqi:29
fighters:27
today:26
air:26
kuwait:24
news:21
targets:20
united:20
years:20
ring:18
people:18
animals:18
new:17
force:17
go:16
dont:16
back:16
theyd:16
sgt:16
canal:16
captivity:16
occupants:16
houston:15
col:15
two:15
year:14
think:14
iraq:14
hes:13
say:13
time:13
found:13
commander:13
statement:13
first:13
billion:13
proposal:13
states:13
reach:13
like:12
held:12
officials:12
three:11
household:11
forest:11
west:11
texas:11
wild:10
happened:10
ago:10
nations:10
home:10
reactions:10
havent:10
boat:9
early:9
give:9
know:9
saturday:9
told:9
guy:9
government:9
()
()
()
cents:80
cent:63
lower:62
higher:56
bushel:43
futures:38
corn:34
soybean:33
pound:23
soybeans:20
cattle:20
chicago:18
trading:18
december:17
grain:15
early:15
wheat:15
new:15
mixed:13
march:13
prices:13
livestock:12
delivery:11
board:11
soviet:11
opened:11
million:11
trade:10
february:10
live:10
inc:10
contract:9
settled:9
union:9
meat:9
oats:9
lespinasse:9
gallery:9
sharply:9
basements:8
chart:8
morning:8
pork:8
crop:8
york:8
november:7
purchases:7
shadyside:7
passengers:7
brazilian:7
market:7
barney:7
jim:7
american:6
frozen:6
rain:6
exchange:6
thursday:6
last:6
i:6
harvest:6
conditions:6
harris:5
flights:5
()
()
()

We can see that these are not simply random sets of words, and they do share some relation. However, you might note that some of the words should perhaps be removed, like i in the first topic. In-fact, all the stop words should be removed. We also would want to only process stems of words. This was all covered in a previous lecture on my website.

Comparing similarity in topic space¶

Topics are useful on thier own to help discover insights about data. The small sets of words generated above are an effective way to summarize large documents. Yet, topics are often a stepping stone for means to another end. Now that we have an estimate for each document about how much of that document comes from each topic, we can compare the documents in topic space. This means that instead of comparing word per word, we say that two documents are similar if they talk about the same topics.

This tends to be a powerful tool, as two text documents that share a few words may actually refer to the same topic, just using different contexts. For example, one document may refer to the President of the United States, thile the other will refer to Barack Obama.

To do this, we have to project the documents into the topic space. Thus, we need a vactor of topics that summarizes the document. Since the number of topics (100) is smaller than the number of possible words, we have reduced some dimensionality from the raw data. It is much faster to compute similarity weights between 100 topic weights, than it is the vectors of an entire vocabulary with thousands of terms.

In [51]:

# Recall topic structure
print topics[0]

[(4, 0.23308640950051435), (5, 0.018272019956688972), (6, 0.28285256551584864), (55, 0.053032573175587634), (57, 0.15565479689666467), (69, 0.027623579560909248), (78, 0.12412801199967857), (86, 0.016733559808558972), (91, 0.010894691919647625), (95, 0.063639485250666261)]

In [52]:

# Store topic counts in an array and compute all pairwise distances
dense = np.zeros( (len(topics), 100), float)
for ti, t in enumerate(topics):
    for tj, v in t:
        dense[ti,tj] = v

Now dense is a matrix of topics, we can use the pdist function from SciPy to compute pairwise distances.

In [56]:

from scipy.spatial import distance

pairwise = distance.squareform(distance.pdist(dense))

To eliminate the the diagonal element of the maxtrix, we just set it to a high value.

In [57]:

largest = pairwise.max()
for ti in range(len(topics)):
    pairwise[ti,ti] = largest+1

Now for each document, we can look up the closest document easily.

In [95]:

def closest_to(doc_id):
    closest = pairwise[doc_id].argmin()
    return  [topics[doc_id], topics[closest]]

In [97]:

closest_to(4)

Out[97]:

[[(19, 0.019559047176289535),
  (26, 0.22558409595806497),
  (29, 0.16385213217097758),
  (48, 0.03248569993167285),
  (49, 0.11017411035025179),
  (81, 0.064558018455667651),
  (84, 0.12527768608580545),
  (88, 0.14974455920211141),
  (90, 0.0302340624513927),
  (91, 0.077923857552251133)],
 [(4, 0.083694002270392184),
  (26, 0.095885171844944286),
  (28, 0.029771282827320771),
  (49, 0.10535923834513226),
  (55, 0.14365187597272466),
  (74, 0.060767457436357503),
  (75, 0.053161625067594497),
  (86, 0.066717654726595244),
  (88, 0.11676744500542624),
  (95, 0.14454229765782822),
  (96, 0.099010828149528787)]]

Modeling all of Wikipedia¶

In [131]:

In [ ]: