Datamining Twitter with tweepy

Twitter is a worldwide database of human sentiment. About 500 million new tweets go out per day(source). The ability to stream, parse and understand Twitter data at large scale has huge implications for marketing, humanitarian efforts, social sciences and many other ventures. The combination of Twitter and deep-learning methods like sentimant analysis has lead to platforms like SocialBro, which mine twitter data to provide in-depth analytics for businesses. It is even possible to track the entire emotional state of the world at any given time! As Python developers, we have access to easy-to-use tools to directly communicate with Twitter's data, and this puts a world of data right at our fingertips.

To make its data available, Twitter hosts a Representational State Transfer Application Programing Interface (REST API). The API dictates what data twitter makes available, and REST refers to an architectural design pattern for building scalable web services. These concepts are explained quite well in the video below.

In [1]:
from IPython.display import YouTubeVideo

Many major webservices use RESTful APIs, so it's important to get familiar with these concepts. In the video, it was shown how one can interact with an API through the browser itself, or through services like apigee. However, we can also access the TwitterAPI through Python. This enables us to integrate Twitter data with other Python resources like numpy matplotlib and IPython.

There are at least 7 python interfaces to Twitter's REST API. We will use tweepy, since the documentation is clear, and there are interesting applications available to get started.

Installing Tweepy

First you will need to install tweepy. The most straightforward way is through the pip installation tool. Python >= 2.7.9 should come installed with pip, but for earlier versions, see this guide for installing pip. This can be run from the command line using:

pip install tweepy

or from within a Canopy IPython shell:

%bash pip install tweepy

If you get this Exception:

TypeError: parse_requirements() got an unexpected keyword argument 'session'

Make sure you upgrade pip to the newest version:

pip install --upgrade pip

Alternatively, you can install tweepy from source by doing the following:

  • Go to the tweepy repo
  • Click Download Zip
  • Extract to a known directy (eg /path/to/Desktop)
  • Open a terminal and cd into that folder
  • Type python install

Configuring Matplotlib

Let's import plotting functions (via pylab) and change default plot settings

In [2]:
#Import pylab, change some default matplotlib setting
%pylab inline

rcParams["figure.figsize"] = (12, 9) #<--- large default figures

# Plot text elements
rcParams['axes.labelsize'] = 17
rcParams['axes.titlesize'] = 17
rcParams['xtick.labelsize'] = 15
rcParams['ytick.labelsize'] = 15
Populating the interactive namespace from numpy and matplotlib


Twitter uses the OAuth protocol for secure application development. Considering all of the applications that access Twitter (for example, using your Twitter account to login to a different website), this protocol prevents information like your password being passed through these intermediate accounts. While this is a great security measure for intermediate client access, it adds an extra step for us before we can directly communicate with the API. To access Twitter, you need to Create an App (

If you are in the PSIS programming course, you will be provided with a consumer and access code via email, which is linked to a shared, dummy account and dummy app.

Store your consumer key and comumer secret somewhere you'll remember them. I'm storing mine in Python strings, but for security, not displaying this step:

consumer_key = 'jrCYD....'
consumer_secret = '...' 

Here is a discussion on the difference between the access token and the consumer token; although, for our intents and purposes, its not so important:**

The consumer key is for your application and client tokens are for end users in your application's context. If you want to call in just the application context, then consumer key is adequate. You'd be rate limited per application and won't be able to access user data that is not public. With the user token context, you'll be rate limited per token/user, this is desirable if you have several users and need to make more calls than application context rate limiting allows. This way you can access private user data. Which to use depends on your scenarios.

Example 1: Read Tweets Appearing on Homepage

With the consumer_key and consumer_secret stored, let's try a Hello World example from Tweepy's docs. This will access the public tweets appearing on the User's feed as if they had logged in to twitter. For brevity, we'll only print the first two.

In [3]:
import tweepy

consumer_key = 'eDpwuOF0vafv0M2HIuD0bTnqy'
consumer_secret = 'JAJxnmcEUkdBNq5oIiBTs8dw5VU9vzMNci4Ds2DnI16fGXF2Lk'

access_token ='3141915249-1y73cRbo1tMFn0x3UqJKKUDn4JiufavhoByJ4FE'
access_token_secret = 'LYwwsll0GbxiNcQgb7TrkD3vVYX3BZbrlpgoZtzvvf9m3'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

public_tweets = api.home_timeline()
for (idx, tweet) in enumerate(public_tweets[0:3]): #First 3 tweets in my public feed
    print 'TWEET %s:\n\n\t%s\n\n' % (idx, tweet.text)

	IBM's CTO for Smarter Energy Research on "smart grids" in the Pacific Northwest #ibmresearch @SmarterPlanet


	Soupy: a wrapper around BeautifulSoup that makes it easier to search through HTML and XML documents |


	Bring #CS to your community! Start, Volunteer, or Find a Club near you: #GWCClubs

When we used tweet.text, we implicitly used a python class defined by tweepy.

In [4]:

There are many attributes associated with a Status object.

In [5]:

According to the tweepy API, we can return the top 10 trending topics for a specific location, where the location is a WOEID (Yahoo Where on Earth ID).

The WOEID is a unique identifier, similar to zipcodes, but that expand worldwide. For example, my hometown of Pittsburgh has a WOEID of 2473224. You can search for WOEID's here:

Let's return the top ten trending topics in Pittsburgh

In [6]:
top10 = api.trends_place(id=2473224)
[{u'as_of': u'2015-04-07T15:17:10Z',
  u'created_at': u'2015-04-07T15:13:53Z',
  u'locations': [{u'name': u'Pittsburgh', u'woeid': 2473224}],
  u'trends': [{u'name': u'#NationalBeerDay',
    u'promoted_content': None,
    u'query': u'%23NationalBeerDay',
    u'url': u''},
   {u'name': u'New Castle',
    u'promoted_content': None,
    u'query': u'%22New+Castle%22',
    u'url': u''},
   {u'name': u'Duke',
    u'promoted_content': None,
    u'query': u'Duke',
    u'url': u''},
   {u'name': u'#OpeningDay',
    u'promoted_content': None,
    u'query': u'%23OpeningDay',
    u'url': u''},
   {u'name': u'Grayson Allen',
    u'promoted_content': None,
    u'query': u'%22Grayson+Allen%22',
    u'url': u''},
   {u'name': u'Rand Paul',
    u'promoted_content': None,
    u'query': u'%22Rand+Paul%22',
    u'url': u''},
   {u'name': u'#BadChoiceFuneralSongs',
    u'promoted_content': None,
    u'query': u'%23BadChoiceFuneralSongs',
    u'url': u''},
   {u'name': u'#WorldHealthDay',
    u'promoted_content': None,
    u'query': u'%23WorldHealthDay',
    u'url': u''},
   {u'name': u'James Best',
    u'promoted_content': None,
    u'query': u'%22James+Best%22',
    u'url': u''},
   {u'name': u'#stlwx',
    u'promoted_content': None,
    u'query': u'%23stlwx',
    u'url': u''}]}]

The result is a JSON object. JSON is a human and machine-readable standardized data encoding format.

In Python, JSON objects are implemented as lists of nested dictionaries. JSON stands for JavaScript Object Notation, because it's designed based on a subset of the JavaScript language; however, JSON is a data-encoding format implemented in many languages.

Looking at this structure, we see that it's contained in a list; in fact its a list of one element. Let's access the top ten tweet names:

In [7]:
[{u'name': u'#NationalBeerDay',
  u'promoted_content': None,
  u'query': u'%23NationalBeerDay',
  u'url': u''},
 {u'name': u'New Castle',
  u'promoted_content': None,
  u'query': u'%22New+Castle%22',
  u'url': u''},
 {u'name': u'Duke',
  u'promoted_content': None,
  u'query': u'Duke',
  u'url': u''},
 {u'name': u'#OpeningDay',
  u'promoted_content': None,
  u'query': u'%23OpeningDay',
  u'url': u''},
 {u'name': u'Grayson Allen',
  u'promoted_content': None,
  u'query': u'%22Grayson+Allen%22',
  u'url': u''},
 {u'name': u'Rand Paul',
  u'promoted_content': None,
  u'query': u'%22Rand+Paul%22',
  u'url': u''},
 {u'name': u'#BadChoiceFuneralSongs',
  u'promoted_content': None,
  u'query': u'%23BadChoiceFuneralSongs',
  u'url': u''},
 {u'name': u'#WorldHealthDay',
  u'promoted_content': None,
  u'query': u'%23WorldHealthDay',
  u'url': u''},
 {u'name': u'James Best',
  u'promoted_content': None,
  u'query': u'%22James+Best%22',
  u'url': u''},
 {u'name': u'#stlwx',
  u'promoted_content': None,
  u'query': u'%23stlwx',
  u'url': u''}]

As you can see, there's alot of metadata that goes into even a simple tweet. Let's cycle through each of these trends, and print the name and website of each.

In [8]:
for trend in top10[0]['trends']:
    print trend['name'], trend['url']
New Castle
Grayson Allen
Rand Paul
James Best

We can mine tweets using either search or stream.

The key difference between stream and search is that stream provides new data as it comes in, while search can be used to query old data. The search API is more powerful for queries, and provides faster access to a wide-range of data. Check out 1400DEV for more about search vs. stream.

Before going forward, you can try doing some search query through Twitter's webpage.

Twitter employs a special query language. For example, the query "traffic?" will return tweets that contain the word traffic and are phrased as a question. Check out more examples here.

Search is implemented directly through tweepy.api. Let's search for a single tweet about traffic, phrased as a question.

In [9]:
results ='traffic?', count=1)
print type(results)
<class 'tweepy.models.SearchResults'>

The result is a tweepy.models.SearchResults class (see other tweepy's models here. ). Rather than just dumping a bunch of JSON data on us, the tweepy api has decoded the JSON and put it into a more pythonic object. So for example, we can access the message in the tweet via python attribute access.

In [10]:
print 'CREATED: %s\n%s\n\n' % (results[0].created_at, results[0].text)
CREATED: 2015-04-07 15:16:51
RT @scontorno: Have an outstanding parking ticket or traffic fine? Pay it April 18 without the late fee.

Let's find 5 tweets that contains the word "beefsteak" near Washington DC. We can provide this as a geocode, a lattitude, longitude and radius string, which I looked up for GWU on Google Maps. We can also specify how far back to look in time; in this case, don't show anything prior to 3/25/2015.

In [11]:
for tweet in'beefsteak since:2015-3-25', count=5, show_user=False,
    print tweet.created_at, '\n',  tweet.text, '\n\n'
2015-04-07 01:51:30 
RT @javiermunoz0909: [email protected] @opsoms @GillesCollette Guys here is my #SafeFood contribution:) from @beefsteak by chef @chefjoseandres ht… 

2015-04-07 00:46:10 
RT @javiermunoz0909: [email protected] @opsoms @GillesCollette Guys here is my #SafeFood contribution:) from @beefsteak by chef @chefjoseandres ht… 

2015-04-07 00:04:24 
[email protected] @opsoms @GillesCollette Guys here is my #SafeFood contribution:) from @beefsteak by chef @chefjoseandres 

2015-04-06 22:46:09 
from Chef José Andrés (at @Beefsteak in Washington, DC) 

2015-04-03 19:25:54 
Finally made it over to @chefjoseandres' @beefsteak. Kimchi-wa. Review to follow. 

Example 4: Streaming and Data Mining

This Streaming tutorial follows closely Adil Moujahid's great tweepy streaming example

Twitter offers a Streaming API to make it easier to query streams of tweets. The Stream API encapsulates some pain points of REST access to ensure that Stream calls don't exceed the rate limit. Think of them as Twitter's suggested means to stream data for beginners. You don't have to use them, but they're recommended and will make life easier. There are three stream types:

  • Public Streams: Streams of public data flowthing through Twitter. Suitable for followign specific users, topics or for data mining.

  • User Streams: Single-user streams. Containing roughly all of the data corresponding with a single user's view of Twitter.

  • Site Streams: The multi-user version of user streams.

We'll resist the temptation to mess with our friend's Twitter accounts, and focus soley on Public Streams. Combining these stream with text filters will let us accumulate content. For example, we could look for tweets involving the text, foxnews. tweepy and Twitter's API will configure the stream and filter to work nicely, you just provide the content tags you're interested in. Finally, remember that the more obsucre the content, the longer it will take to find.

The following snippet will run until `max_tweets` or `max_seconds` is reached. If running in notebook, it will hold up cells until the alotted time. Therefore, for long runtimes, you may want to run in an external python program, and then can terminate at will if desired. I also recommend restarting notebook kernal before running this cell multiple times...

In [43]:
#Import the necessary methods from tweepy library
import sys
import time
import datetime

from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream

#This is a basic listener that just prints received tweets to stdout.
class StreamParser(StreamListener):
    """ Controls how streaming data is parsed. Pass an outfile, or data will be writting to 
    sys.stdout (eg the screen)
    def __init__(self, outfile=None, max_tweets=5, max_seconds=30):
        self.counter = 0
        self.start_time = time.time()
        # Set upper limits on maximum tweets or seconds before timeout
        self.max_tweets = max_tweets
        self.max_seconds = max_seconds
        if outfile:
            self.stdout = open(outfile, 'w')
            self.stdout = sys.stdout
    def on_data(self, data):
        """ Data is a string, but formatted for json. Parses it"""
        self.counter += 1
        # time data is all timestamps.
        current_time = time.time()
        run_time = current_time - self.start_time
        # If we want to read time, easiest way is to convert from timestamp using datetime
        formatted_time =
        # Technically, might not be the best place to put kill statements, but works well enough
        if self.max_tweets:
            if self.counter > self.max_tweets:
                raise SystemExit('Max tweets of %s exceeded.  Killing stream... see %s' \
                             % (self.max_tweets, self.stdout))
        if self.max_seconds:
            if run_time > self.max_seconds:
                raise SystemExit('Max time of %s seconds exceeded.  Killing stream... see %s' \
                                 % (self.max_seconds, self.stdout))

        print 'Tweet %s at %s.\nEllapsed: %.2f seconds\n' % \
             (self.counter, formatted_time, run_time)

        # Write to file, return True causes stream to continue I guess...
        return True

    def _kill_stdout(self):
        """ If self.stdout is a file, close it.  If sys.stdout, pass"""
        if self.stdout is not sys.stdout:
    def on_error(self, status):
        print status

#This handles Twitter authetification and the connection to Twitter Streaming API
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

# Stream 10 tweets, no matter the time it takes!
listener = StreamParser(outfile='test.txt', max_tweets=5, max_seconds=None)
stream = Stream(auth, listener)

#This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'
stream.filter(track=['obama', 'kenya', 'shabab', 'puppies'])
Tweet 1 at 2015-04-02 21:19:19.315274.
Ellapsed: 0.79 seconds

Tweet 2 at 2015-04-02 21:19:19.411817.
Ellapsed: 0.88 seconds

Tweet 3 at 2015-04-02 21:19:19.642709.
Ellapsed: 1.11 seconds

Tweet 4 at 2015-04-02 21:19:19.805070.
Ellapsed: 1.27 seconds

Tweet 5 at 2015-04-02 21:19:19.862497.
Ellapsed: 1.33 seconds

An exception has occurred, use %tb to see the full traceback.

SystemExit: Max tweets of 5 exceeded.  Killing stream... see <closed file 'test.txt', mode 'w' at 0x7fa87d0498a0>
To exit: use 'exit', 'quit', or Ctrl-D.

How is Python translated into REST?

stream.filter(), which actually returned the tweets, is a method of the class tweepy.Stream. Stream provides a Python to frontend, and on the backend sends HTTP requests to Twitter as described here. We could have avoided Python altogether and just sent HTTP requests directly, but this is laborious.

Because tweepy is open-source, we can look at the source code for the Stream class, here. Specifically, let's try to understand what the filter method is doing. Let's look at the filter source code explicitly:

def filter(self, follow=None, track=None, async=False, locations=None,
            stall_warnings=False, languages=None, encoding='utf8'):

    self.body = {}
    self.session.headers['Content-type'] = "application/x-www-form-urlencoded"
    if self.running:
        raise TweepError('Stream object already connected!')
    self.url = '/%s/statuses/filter.json' % STREAM_VERSION
    if follow:
        self.body['follow'] = u','.join(follow).encode(encoding)
    if track:
        self.body['track'] = u','.join(track).encode(encoding)
    if locations and len(locations) > 0:
        if len(locations) % 4 != 0:
            raise TweepError("Wrong number of locations points, "
                             "it has to be a multiple of 4")
        self.body['locations'] = u','.join(['%.4f' % l for l in locations])
    if stall_warnings:
        self.body['stall_warnings'] = stall_warnings
    if languages:
        self.body['language'] = u','.join(map(str, languages))
    self.session.params = {'delimited': 'length'} = ''

Essentially, keywords like track and locations can be used to customize what types of tweets are streamed. tweepy translates these into a series of HTTP requests and sends them to the TwitterAPI. For example, we can see how track is interpreted by the Twitter RESTAPI here.

Loading streamed data

While search returned python objects, stream returns raw JSON data. The search API translated JSON data into more convienent python objects; however, to parse the Stream data, we'll have to work with JSON data directly. This is a good exercise, because JSON is widely used and its important to get familiar with.

If only one tweet were saved, we could just use json.loads() to read it in right away, but for a file with multiple tweets, we need to read them in one at a time.

Each tweet JSON object is one long line, so we can read in line by line, until an error is reached in which case we just stop. Let's load in the file streamed_5000.txt, which stores 5000 tweets (about 21 MiB filesize) for the keywords 'obama' 'shabab' 'puppies' 'kenya'. These keywords were chosen so we can do sentiment analysis later.

In [20]:
import json

tweets = []
for line in open('streamed_5000.txt', 'r'):
In [21]:

The tweet text itself is embedded in the text metadata field

In [22]:
u'RT @cnni: 147 killed in university massacre, deadliest terror attack in Kenya since U.S. Embassy bombed in 1998 #Gar\u2026'

Check out all of the metadata you can get from a tweet!

In [23]:

Within these fields, there's even more information. For example, the user and entities fields, which provide information about the user as well as links and images (entities) embedded in the tweet:

In [24]:
{u'contributors_enabled': False,
 u'created_at': u'Tue Jul 06 02:14:03 +0000 2010',
 u'default_profile': False,
 u'default_profile_image': False,
 u'description': u'vivir y dejar vivir  respetar y tener compasion de mis semejantes.',
 u'favourites_count': 2789,
 u'follow_request_sent': None,
 u'followers_count': 577,
 u'following': None,
 u'friends_count': 726,
 u'geo_enabled': True,
 u'id': 163301786,
 u'id_str': u'163301786',
 u'is_translator': False,
 u'lang': u'en',
 u'listed_count': 2,
 u'location': u'u.s.a.',
 u'name': u'Imelda Villegas',
 u'notifications': None,
 u'profile_background_color': u'131516',
 u'profile_background_image_url': u'',
 u'profile_background_image_url_https': u'',
 u'profile_background_tile': True,
 u'profile_banner_url': u'',
 u'profile_image_url': u'',
 u'profile_image_url_https': u'',
 u'profile_link_color': u'009999',
 u'profile_sidebar_border_color': u'EEEEEE',
 u'profile_sidebar_fill_color': u'EFEFEF',
 u'profile_text_color': u'333333',
 u'profile_use_background_image': True,
 u'protected': False,
 u'screen_name': u'imeldareyna46',
 u'statuses_count': 39437,
 u'time_zone': None,
 u'url': u'',
 u'utc_offset': None,
 u'verified': False}
In [25]:
{u'hashtags': [{u'indices': [135, 140], u'text': u'GarissaAttack'}],
 u'symbols': [],
 u'trends': [],
 u'urls': [{u'display_url': u'',
   u'expanded_url': u'',
   u'indices': [112, 134],
   u'url': u''}],
 u'user_mentions': [{u'id': 2097571,
   u'id_str': u'2097571',
   u'indices': [3, 8],
   u'name': u'CNN International',
   u'screen_name': u'cnni'}]}

Check out this infographic on all of the metadata in a tweet, taken from Slaw: Canada's online legal magazine. Is this ethical?

In [26]:
from IPython.display import Image