Topic Modeling of Twitter Followers

This python 2 notebook is a companion to the blog post Segmentation of Twitter Timelines via Topic Modeling where we explore a corpus of Twitter timelines composed of the followers of the @alexip account and compare the results obtained through Latent Semantic Allocation vs Latent Dirichlet Allocation (LDA). Below are the results for LDA on a set of 245 timelines.

Some of the best topics are:

etc ...

In [6]:
from gensim import corpora, models
import pyLDAvis.gensim

corpus = corpora.MmCorpus('data/')
dictionary = corpora.Dictionary.load('data/alexip_followers_v3.dict')

lda = models.LdaModel.load('data/alexip_followers_v3_t40_p200_a001.lda')
followers_data =  pyLDAvis.gensim.prepare(lda,corpus, dictionary)

For Best results set the $\lambda$ parameter between 0.5 and 0.6. Lowering $\lambda$ increases the relative importance of words that are discriminant to a certain topic.

We use the amazing LDAvis package for this visualization. LDa was carried out with the Gensim package. The data is available in a Json 3M gz file.