Machine Learning: Natural Language Processing¶

This is a series of notebooks (in progress) to document my learning, and hopefully to help others learn machine learning. I would love suggestions / corrections / feedback for these notebooks.

Visit my webpage for more.

Email me: email.ryan.kelly@gmail.com

This notebook covers or includes:¶

Introduction to language processing
Gathering Twitter data and pre processing tweets

Data Sources:¶

nltk book

TO DO:¶

I just started, give me a break

In Progress:

Data Cleaning
Regression

Coming Soon:

Clustering
Dimension Reduction
Natural Language Processing

Getting Started:¶

Generally, raw text is fairly meaningless for a computer to process. Natural language processing involves transforming text into meaningful numbers so that we can leverage machine learning techniques such as clustering, classification, support vector machines, and many others.

The preferred approach to quantify text information is the bag-of-word technique. For every word in the text body, its occurance is tallied then recorded as a vector in a process known as vectorization.

Let's start at the beginning:

In [ ]:

First we need to load the nltk library, then open the bundled collections of data that they provide for free using nltk.download(). Once you enter this command, dowload all the data they provide for the book, as we will be using some of the content here.

In [8]:

import nltk
nltk.download()

showing info http://nltk.github.com/nltk_data/

Out[8]:

True

Now that we have some text to work with, we can import what we downloaded.

In [9]:

from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

In [10]:

text1

Out[10]:

<Text: Moby Dick by Herman Melville 1851>

The most basic commends we can do with text are simple keyword searches. We can do this using the concordance view. I think the chat corpus (text 5) is the most interesting for now, as it contains ~10 000 chat posts from age-specific chat rooms.

The concordance view finds the keyword of interest, then shows surrounding sentence to give context.

Here we search for text containing the word "weather".

In [112]:

text5.concordance("weather")

Displaying 7 of 7 matches:
 for asking -- can you believe this weather -- Yeah it 's fall !!! hi U46 good 
to look out the window to check the weather ... haha www.Wunderground.com : . L
to look out the window to check the weather ... thanks !!! . Advisory //www.wun
end Live Oak , California ( 95953 ) weather ) www.Wunderground.com : . Fairbank
 ( end Fairbanks , Alaska ( 99701 ) weather ) sigh i am 20 boy PART from azerba
to look out the window to check the weather ... www.Wunderground.com : . Aberde
d Aberdeen , South Dakota ( 57401 ) weather ) i 'd never leave you alone U35 ..

While these datasets are interesting enough, lets pull in some Twitter data instead!

Retrieving tweets from Twitter:¶

First you will need to create an application to retrieve your authentication keys which allow you to access the Twitter API. You will also need to pip install twitter.

In [120]:

import twitter
import json
# Load in your application keys
CONSUMER_KEY = 'EwFMhx5EOCp7kAyxf5gkfRNAr'
CONSUMER_SECRET ='eRKvDIWWMWCWOM1aztw6wK4A9nogJBvDZZBT1QVOtUf2PVMDlI'
OAUTH_TOKEN = '1358375179-QXI5GTDTAaS6KytB249aleNEtOQRSMB4vYlvhHA'
OAUTH_TOKEN_SECRET = 'MX20qtIAnibJwYrGYVSBBbTQCIr6Ss2Tg3NZs7cBuaHEf'

auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
                           CONSUMER_KEY, CONSUMER_SECRET)
# Create twitter streaming object
twitter_api = twitter.Twitter(auth=auth)

print twitter_api

<twitter.api.Twitter object at 0x1205cd3d0>

In [151]:

q = '#RaptorsDay' 

count = 100

results = twitter_api.search.tweets(q=q, count=count)

statuses = results['statuses']


# Iterate through 5 more batches of results by following the cursor

for _ in range(15):
    print "Length of statuses", len(statuses)
    try:
        next_results = results['search_metadata']['next_results']
    except KeyError, e: # No more results when next_results doesn't exist
        break
        
    # Create a dictionary from next_results, which has the following form:
    kwargs = dict([ kv.split('=') for kv in next_results[1:].split("&") ])
    
    search_results = twitter_api.search.tweets(**kwargs)
    statuses += results['statuses']

# Get just the tweets from the data

status_texts = [ status['text'] 
                 for status in statuses ]

Length of statuses 100
Length of statuses 200
Length of statuses 400
Length of statuses 800
Length of statuses 1600
Length of statuses 3200
Length of statuses 6400
Length of statuses 12800
Length of statuses 25600
Length of statuses 51200
Length of statuses 102400
Length of statuses 204800
Length of statuses 409600
Length of statuses 819200
Length of statuses 1638400

In [155]:

import pandas as pd

tweets = pd.DataFrame(status_texts)

tweets.to_csv("/users/ryankelly/projects/raptorsday.csv", sep = "\t", encoding='utf-8')

In [150]:

Out[150]:

	0
0	beats,instrumentals,lyrics or features for you...
1	@Raptors I didn't know it was #RaptorsDay e...
2	RT @RaptorsDancePak: It's official! May 12th i...
3	#CityOfToronto needs to get some self-respect....
4	RT @Raptors: Pardon the glare, but a closer lo...

5 rows × 1 columns

In [ ]: