Here, we're going to go over some basic text cleaning steps in Python.
raw_docs = ["Here are some very simple basic sentences.",
"They won't be very interesting, I'm afraid.",
"The point of these examples is to _learn how basic text cleaning works_ on *very simple* data."]
NLTK makes it easy to convert documents-as-strings into word-vectors, a process called tokenizing.
from nltk.tokenize import word_tokenize
tokenized_docs = [word_tokenize(doc) for doc in raw_docs]
print tokenized_docs
[['Here', 'are', 'some', 'very', 'simple', 'basic', 'sentences', '.'], ['They', 'wo', "n't", 'be', 'very', 'interesting', ',', 'I', "'m", 'afraid', '.'], ['The', 'point', 'of', 'these', 'examples', 'is', 'to', '_learn', 'how', 'basic', 'text', 'cleaning', 'works_', 'on', '*very', 'simple*', 'data', '.']]
Punctuation can help with tokenizers, but once you've done that, there's no reason to keep it around. There are tons of ways to remove punctuation. Since we have already learned regex, how would we do this?
import re
import string
regex = re.compile('[%s]' % re.escape(string.punctuation)) #see documentation here: http://docs.python.org/2/library/string.html
tokenized_docs_no_punctuation = []
for review in tokenized_docs:
new_review = []
for token in review:
new_token = regex.sub(u'', token)
if not new_token == u'':
new_review.append(new_token)
tokenized_docs_no_punctuation.append(new_review)
print tokenized_docs_no_punctuation
[['Here', 'are', 'some', 'very', 'simple', 'basic', 'sentences'], ['They', 'wo', u'nt', 'be', 'very', 'interesting', 'I', u'm', 'afraid'], ['The', 'point', 'of', 'these', 'examples', 'is', 'to', u'learn', 'how', 'basic', 'text', 'cleaning', u'works', 'on', u'very', u'simple', 'data']]
There are some really basic words that just don't matter. NLTK comes with a list of them for many languages.
from nltk.corpus import stopwords
tokenized_docs_no_stopwords = []
for doc in tokenized_docs_no_punctuation:
new_term_vector = []
for word in doc:
if not word in stopwords.words('english'):
new_term_vector.append(word)
tokenized_docs_no_stopwords.append(new_term_vector)
print tokenized_docs_no_stopwords
[['Here', 'simple', 'basic', 'sentences'], ['They', 'wo', u'nt', 'interesting', 'I', u'm', 'afraid'], ['The', 'point', 'examples', u'learn', 'basic', 'text', 'cleaning', u'works', u'simple', 'data']]
If you have taken linguistics, you may be familiar with morphology. This is the belief that words have a root form. If you want to get to the basic term meaning of the word, you can try applying a stemmer or lemmatizer. Here are three very popular methods ready to go right out of the NLTK box. It's up to you to see which one you want to use.
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
porter = PorterStemmer()
snowball = SnowballStemmer('english')
wordnet = WordNetLemmatizer()
preprocessed_docs = []
for doc in tokenized_docs_no_stopwords:
final_doc = []
for word in doc:
final_doc.append(porter.stem(word))
#final_doc.append(snowball.stem(word))
#final_doc.append(wordnet.lemmatize(word)) #note that lemmatize() can also takes part of speech as an argument!
preprocessed_docs.append(final_doc)
print preprocessed_docs
[['Here', 'simpl', 'basic', 'sentenc'], ['They', 'wo', u'nt', 'interest', 'I', u'm', 'afraid'], ['The', 'point', 'exampl', u'learn', 'basic', 'text', 'clean', u'work', u'simpl', 'data']]
import os
import csv
#os.chdir('/Users/rweiss/Dropbox/presentations/IRiSS2013/text2/extra/')
with open('amazon/sociology_2010.csv', 'rb') as csvfile:
amazon_reader = csv.DictReader(csvfile, delimiter=',')
amazon_reviews = [row['review_text'] for row in amazon_reader]
#your code here!!!
Recall that HTML entities are an artifact from the pre-Unicode era. Browsers know to render HTML entities a certain way on the page, but we don't need them anymore.
Here's some code that will do this for you (function courtesy of the author of lxml).
import re, htmlentitydefs
##
# Removes HTML or XML character references and entities from a text string.
#
# @param text The HTML (or XML) source text.
# @return The plain text, as a Unicode string, if necessary.
# AUTHOR: Fredrik Lundh
def unescape(text):
def fixup(m):
text = m.group(0)
if text[:2] == "&#":
# character reference
try:
if text[:3] == "&#x":
return unichr(int(text[3:-1], 16))
else:
return unichr(int(text[2:-1]))
except ValueError:
pass
else:
# named entity
try:
text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
except KeyError:
pass
return text # leave as is
return re.sub("&#?\w+;", fixup, text)
test_string ="<p>While many of the stories tugged at the heartstrings, I never felt manipulated by the authors. (Note: Part of the reason why I don't like the "Chicken Soup for the Soul" series is that I feel that the authors are just dying to make the reader clutch for the box of tissues.)"
print test_string
print unescape(test_string)
<p>While many of the stories tugged at the heartstrings, I never felt manipulated by the authors. (Note: Part of the reason why I don't like the "Chicken Soup for the Soul" series is that I feel that the authors are just dying to make the reader clutch for the box of tissues.) <p>While many of the stories tugged at the heartstrings, I never felt manipulated by the authors. (Note: Part of the reason why I don't like the "Chicken Soup for the Soul" series is that I feel that the authors are just dying to make the reader clutch for the box of tissues.)
import nltk
nltk.clean_html(unescape(test_string.decode('utf8'))) #notice that it returns unicode!
u'While many of the stories tugged at the heartstrings, I never felt manipulated by the authors. (Note: Part of the reason why I don\'t like the "Chicken Soup for the Soul" series is that I feel that the authors are just dying to make the reader clutch for the box of tissues.)'