Analysis of Twitter stream data with the IPython Notebook

In this example, we use the IPython notebook to mine data from Twitter with the Twython library. Once we have fetched the raw stream for a specific query, we will at first do some basic word frequency analysis on the results using Python's builtin dictionaries, and then we will use the excellent NetworkX library developed at Los Alamos National Laboratory to look at the results as a network and understand some of its properties.

Using NetworkX, we aim to answer the following questions: for a given query, which words tend to appear together in tweets, and global pattern of relationships between these words emerges from the entire set of results?

Obviously the analysis of text corpora of this kind is a complex topic at the intersection of natural language processing, graph theory and statistics, and here we do not pretend to provide an exhaustive coverage of it. Rather, we want to show you how with a small amount of easy to write code, it is possible to do a few non-trivial things based on real-time data from the Twitter stream. Hopefully this will serve as a good starting point; for further reading you can find in-depth discussions of analysing social network data in Python in the book Mining the Social Web.

$$ F = ma $$

Initialization and libraries

We start by loading the pylab plot support and selecting our figure size to be a bit different than the automatic defaults.

In [4]:
%pylab inline
plt.rc('figure', figsize=(8, 5))
import networkx as nx
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

Now, we load a local library with some analysis utilities whose code is a bit long to display inline. The python module is called text_utils.py and can be downloaded here.

In [5]:
import text_utils as tu  # shorthand for convenience

Finally, we'll need to use the free Twython library to query Twitter's stream:

In [6]:
from twython import Twython
# Create the main Twitter object we'll use later for all queries
twitter = Twython()

Query declaration

Here we define which query we want to perform, as well as which words we want to filter out from our analysis because they appear very commonly and we're not interested in them.

Typically you want to run the query once, and after seeing what comes out, fine-tune the removal list, as which words are 'noise' is fairly query-specific (and also changes over time, depending on what's happening out there on Twitter):

In [7]:
query = "big data"
words_to_remove = """with some your just have from it's /via & that they your there this"""

Perform query to Twitter servers

This is the cell that actually fetches data from Twitter. We limit the output to the first 30 pages of search max (typically Twitter stops returning results before that).

In [8]:
n_pages = 30

results = []
retweets = []
for page in range(1, n_pages+1):
    search = twitter.search(q=query+' lang:en', page=str(page))
    res = search['results']
    if not res:
        print 'Stopping at page:', page
        break
        
    for t in res:
        if t['text'].startswith('RT '):
            retweets.append(t)
        else:
            results.append(t)
            
tweets = [t['text'] for t in results]

# Quick summary
print 'query:   ', query
print 'results: ', len(results)
print 'retweets:', len(retweets)
print 'Variable `tweets` has a list of all the tweet texts'
Stopping at page: 23
query:    big data
results:  196
retweets: 134
Variable `tweets` has a list of all the tweet texts

Text statistics

Let's see what the first 10 tweets look like:

In [9]:
tweets[:10]
Out[9]:
[u'Fairy tales are hard to write. \u201c@samplereality: Big data often reduces complexity in a way similar to bedtime stories nd fairy tales...\u201d',
 u'Forbes on impact of #BigData analytics across a kaleidoscopic plethora of industries http://t.co/PRQD8NU1',
 u'Pre-FAQ Why fast skip for LZF? To suppo fast Range access (over HTTP) on big compressed data',
 u'Big Rogers Problems: @TechCrunch is even reporting about it! Internet and phone data down: http://t.co/7tssVP2G',
 u'CDNs Impact Value Chain for Big Data and Media Content Delivery http://t.co/zGa2Zkxn #photography #dslr',
 u'Big Bets On Big Data - #BigData and #datascience will become the new "hot" discipline for the next generation - http://t.co/s5bRIHix',
 u'Business schools and digital humanities: big data, distant reading...hegemony of cybernetic reason is everywhere today',
 u'Mercy Health shows how #BigData is like watching a million fireflies, and knowing each one. http://t.co/h8u4ugFz #healthcare #CEP #analytics',
 u'White Paper: Marketing With Big Data to Increase ROI http://t.co/pBMeSHSQ',
 u'The Science Behind Social Media, Natural Language and Big Data (Infographic.. http://t.co/LzEQuL67 (via @ArtilleryMarket)']

Now we do some cleanup of the common words above, so that we can then compute some basic statistics:

In [10]:
remove = tu.removal_set(words_to_remove, query)
lines = tu.lines_cleanup([tweet['text'].encode('utf-8') for tweet in results], remove=remove)
words = '\n'.join(lines).split()

Compute frequency histogram:

In [11]:
wf = tu.word_freq(words)
sorted_wf = tu.sort_freqs(wf)

Let's look at a summary of the word frequencies from this dataset:

In [12]:
tu.summarize_freq_hist(sorted_wf)
Number of unique words: 997

10 least frequent words:
           represent -> 1
               chain -> 1
http://t.co/jbc0dqzg -> 1
             managem -> 1
               leads -> 1
               talks -> 1
                disk -> 1
           reasoning -> 1
   @artillerymarket) -> 1
           retention -> 1

10 most frequent words:
    marketing -> 9
   #analytics -> 13
    analytics -> 14
(infographic) -> 22
      natural -> 29
     language -> 29
       behind -> 30
       social -> 32
      science -> 32
        media -> 32

Now we can plot the histogram of the n_words most frequent words:

In [13]:
n_words = 10
tu.plot_word_histogram(sorted_wf, n_words,"Frequencies for %s most frequent words" % n_words);

Above we trimmed the historgram to only show n_words because the distribution is very sharply peaked; this is what the histogram for the whole word list looks like:

In [14]:
tu.plot_word_histogram(sorted_wf, 1.0, "Frequencies for entire word list");

Co-occurrence graph

An interesting question to ask is: which pairs of words co-occur in the same tweets? We can find these relations and use them to construct a graph, which we can then analyze with NetworkX and plot with Matplotlib.

We limit the graph to have at most n_nodes (for the most frequent words) just to keep the visualization easier to read.

In [15]:
n_nodes = 10
popular = sorted_wf[-n_nodes:]
pop_words = [wc[0] for wc in popular]
co_occur = tu.co_occurrences(lines, pop_words)
wgraph = tu.co_occurrences_graph(popular, co_occur, cutoff=1)
wgraph = nx.connected_component_subgraphs(wgraph)[0]

An interesting summary of the graph structure can be obtained by ranking nodes based on a centrality measure. NetworkX offers several centrality measures, in this case we look at the Eigenvector Centrality:

In [16]:
centrality = nx.eigenvector_centrality_numpy(wgraph)
tu.summarize_centrality(centrality)
Graph centrality
         social: 0.389
          media: 0.389
        science: 0.383
       language: 0.383
         behind: 0.383
        natural: 0.383
  (infographic): 0.332
      analytics: 0.00952
     #analytics: 0.00952
      marketing: 0.00712

And we can use this measure to provide an interesting view of the structure of our query dataset:

In [17]:
print "Graph visualization for query:", query
tu.plot_graph(wgraph, tu.centrality_layout(wgraph, centrality), plt.figure(figsize=(8,8)),
    title='Centrality and term co-occurrence graph, q="%s"' % query)
Graph visualization for query: big data
In [17]: