This is a series of notebooks (in progress) to document my learning, and hopefully to help others learn machine learning. I would love suggestions / corrections / feedback for these notebooks.
Email me: email.ryan.kelly@gmail.com
I just started, give me a break
In Progress:
Coming Soon:
Generally, raw text is fairly meaningless for a computer to process. Natural language processing involves transforming text into meaningful numbers so that we can leverage machine learning techniques such as clustering, classification, support vector machines, and many others.
The preferred approach to quantify text information is the bag-of-word technique. For every word in the text body, its occurance is tallied then recorded as a vector in a process known as vectorization.
Let's start at the beginning:
First we need to load the nltk library, then open the bundled collections of data that they provide for free using nltk.download()
.
Once you enter this command, dowload all the data they provide for the book, as we will be using some of the content here.
import nltk
nltk.download()
showing info http://nltk.github.com/nltk_data/
True
Now that we have some text to work with, we can import what we downloaded.
from nltk.book import *
*** Introductory Examples for the NLTK Book *** Loading text1, ..., text9 and sent1, ..., sent9 Type the name of the text or sentence to view it. Type: 'texts()' or 'sents()' to list the materials. text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty Python and the Holy Grail text7: Wall Street Journal text8: Personals Corpus text9: The Man Who Was Thursday by G . K . Chesterton 1908
text1
<Text: Moby Dick by Herman Melville 1851>
The most basic commends we can do with text are simple keyword searches. We can do this using the concordance view
. I think the chat corpus (text 5) is the most interesting for now, as it contains ~10 000 chat posts from age-specific chat rooms.
The concordance view
finds the keyword of interest, then shows surrounding sentence to give context.
Here we search for text containing the word "weather".
text5.concordance("weather")
Displaying 7 of 7 matches: for asking -- can you believe this weather -- Yeah it 's fall !!! hi U46 good to look out the window to check the weather ... haha www.Wunderground.com : . L to look out the window to check the weather ... thanks !!! . Advisory //www.wun end Live Oak , California ( 95953 ) weather ) www.Wunderground.com : . Fairbank ( end Fairbanks , Alaska ( 99701 ) weather ) sigh i am 20 boy PART from azerba to look out the window to check the weather ... www.Wunderground.com : . Aberde d Aberdeen , South Dakota ( 57401 ) weather ) i 'd never leave you alone U35 ..
While these datasets are interesting enough, lets pull in some Twitter data instead!
First you will need to create an application to retrieve your authentication keys which allow you to access the Twitter API. You will also need to pip install twitter
.
import twitter
import json
# Load in your application keys
CONSUMER_KEY = 'EwFMhx5EOCp7kAyxf5gkfRNAr'
CONSUMER_SECRET ='eRKvDIWWMWCWOM1aztw6wK4A9nogJBvDZZBT1QVOtUf2PVMDlI'
OAUTH_TOKEN = '1358375179-QXI5GTDTAaS6KytB249aleNEtOQRSMB4vYlvhHA'
OAUTH_TOKEN_SECRET = 'MX20qtIAnibJwYrGYVSBBbTQCIr6Ss2Tg3NZs7cBuaHEf'
auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,
CONSUMER_KEY, CONSUMER_SECRET)
# Create twitter streaming object
twitter_api = twitter.Twitter(auth=auth)
print twitter_api
<twitter.api.Twitter object at 0x1205cd3d0>
q = '#RaptorsDay'
count = 100
results = twitter_api.search.tweets(q=q, count=count)
statuses = results['statuses']
# Iterate through 5 more batches of results by following the cursor
for _ in range(15):
print "Length of statuses", len(statuses)
try:
next_results = results['search_metadata']['next_results']
except KeyError, e: # No more results when next_results doesn't exist
break
# Create a dictionary from next_results, which has the following form:
kwargs = dict([ kv.split('=') for kv in next_results[1:].split("&") ])
search_results = twitter_api.search.tweets(**kwargs)
statuses += results['statuses']
# Get just the tweets from the data
status_texts = [ status['text']
for status in statuses ]
Length of statuses 100 Length of statuses 200 Length of statuses 400 Length of statuses 800 Length of statuses 1600 Length of statuses 3200 Length of statuses 6400 Length of statuses 12800 Length of statuses 25600 Length of statuses 51200 Length of statuses 102400 Length of statuses 204800 Length of statuses 409600 Length of statuses 819200 Length of statuses 1638400
import pandas as pd
tweets = pd.DataFrame(status_texts)
tweets.to_csv("/users/ryankelly/projects/raptorsday.csv", sep = "\t", encoding='utf-8')
0 | |
---|---|
0 | beats,instrumentals,lyrics or features for you... |
1 | @Raptors I didn't know it was #RaptorsDay e... |
2 | RT @RaptorsDancePak: It's official! May 12th i... |
3 | #CityOfToronto needs to get some self-respect.... |
4 | RT @Raptors: Pardon the glare, but a closer lo... |
5 rows × 1 columns