# An introduction to NLTK - processing raw text and basic analysis¶

## Working with raw text¶

Some corpora have already been marked up for use with NLTK, but you're often going to want to work with your own texts. So how to we load them in and prepare them for use with NLTK? We're going to start by looking at some plain text (.txt) files of speeches and press releases from the Malcolm Fraser archive, held by the University of Melbourne. We'll look at some of the advantages and disadvantages of using NLTK, and problems of data wrangling. You can check out the Fraser Archive here: http://www.unimelb.edu.au/malcolmfraser/

First of all, let's load in our text.

Via file management, open and inspect one file. What do you see? Are there any potential problems?

In [ ]:
from __future__ import division
import nltk, re, pprint
import os
#import tokenizers
from nltk import word_tokenize
from nltk.text import Text

In [ ]:
nltk.data.path.append('/home/researcher/nltk_data/')


Now, run the above import statements. You'll need these to import and process raw text. Now that we've got our texts, let's have a look at what is in the file directory.

In [ ]:
#access items in the directory 'UMA_Fraser_Radio_Talks' and view the first 3


## Basic text analysis¶

First we'll read in one speech and tokenize it. This means breaking it up into words for analysis

In [ ]:
#open a file and call the content 'speech'
#tokenize the speech and call the result 'vocab'
vocab = word_tokenize(speech)

In [ ]:
len(vocab)

In [ ]:
len(set(vocab))

In [ ]:
vocab.count('South')

In [ ]:
len(vocab)/len(set(vocab))

In [ ]:
V = set(vocab)
long_words = [word for word in V if len(word) > 12]
sorted(long_words)


To perform more complex operations, we'll need to use a different tokenizer

In [ ]:
sent_vocab = Text(word_tokenize(speech))

In [ ]:
sent_vocab.concordance('wool')

In [ ]:
sent_vocab.collocations()

In [ ]:
#build a table of the 15 most common words in the text
from nltk.probability import FreqDist
fdist1 = FreqDist(sent_vocab)
fdist1.tabulate(15)

In [ ]:
#graph the 20 most common words in the text
%matplotlib inline
fdist1.plot(20, cumulative=True)

In [ ]:
fdist1.max()

In [ ]:
100.0*fdist1.freq('Portland')

In [ ]:
vocab[:20]

In [ ]:
len(set(word.lower() for word in vocab))

In [ ]:
len(set(word.lower() for word in vocab if word.isalpha()))


## Exploring further: splitting up text¶

We've had a look at one file, but the real strength of NLTK is to be able to explore large bodies of text. When we manually inspected the first file, we saw that it contained a metadata section, before the body of the text. We can ask Python to show us the start of the file. For analysing the text, it is useful to split the metadata section off, so that we can interrogate it separately but also so that it won't distort our results when we analyse the text.

In [ ]:
#view the first 100 characters of the first file

In [ ]:
#open the first file, read it and then split it into two parts, metadata and body

In [ ]:
#view the first part
data[0]

In [ ]:
#split into lines, add '*' to the start of each line
for line in data[0].split('\r\n'):
print '*', line

In [ ]:
#get rid of any line that starts with '<'
for line in data[0].split('\r\n'):
if line[0] == '<':
continue
print '*', line

In [ ]:
#skip empty lines and any line that starts with '<'
for line in data[0].split('\r\n'):
if not line:
continue
if line[0] == '<':
continue
print '*', line

In [ ]:
#split the metadata items on ':' so that we can interrogate each one
for line in data[0].split('\r\n'):
if not line:
continue
if line[0] == '<':
continue
element = line.split(':')
print '*', element

In [ ]:
#actually, only split on the first colon
for line in data[0].split('\r\n'):
if not line:
continue
if line[0] == '<':
continue
element = line.split(':', 1)
print '*', element


## Build a dictionary and define a function¶

We've now split up the elements of the metadata, but we want to be able to interrogate it so that we can start to find out something about the collection of files. To do that, we need to build a dictionary.

In [ ]:
metadata = {}
for line in data[0].split('\r\n'):
if not line:
continue
if line[0] == '<':
continue
element = line.split(':', 1)

In [ ]:
#look up the date


Creating a function means that we can perform an operation multiple times without having to type out all the code every time. There are over 700 files in our directory, so by defining a function and running it over all the files in our directory, we can then interrogate the collection and learn something about it. Creating a function also means that we can be sure that the exactly the same thing is happening each time

In [ ]:
#open the first file, read it and then split it into two parts, metadata and body

In [ ]:
#define a function that breaks up the metadata for each file and gets rid of the whitespace at the start of each element
for line in text.split('\r\n'):
if not line:
continue
if line[0] == '<':
continue
element = line.split(':', 1)

In [ ]:
parse_metadata(data[0])


## Putting it together: exploring multiple files¶

Now that we're confident that the function works, let's find out a bit about the corpus. As a start, it would be useful to know which years the texts are from. Are they evenly distributed over time? A graph will tell us!

In [ ]:
#import conditional frequency distribution
from nltk.probability import ConditionalFreqDist
cfdist = ConditionalFreqDist()
#split text of file on 'end metadata'
#skip all speeches for which there is no exact date
continue
#build a frequency distribution graph by year, that is, take the final bit of the 'Date' string after '/'
cfdist.plot()

In [ ]:
cfdistA = ConditionalFreqDist()
#split text of file on 'end metadata'
if date[0] == 'c':
year = date[1:]
elif date[0] != 'c':
year = date.split('/')[-1]
if year:
cfdistA['count'][year] += 1
cfdistA.plot()

In [ ]:
cfdist2 = ConditionalFreqDist()
#split text of file on 'end metadata'
#skip all speeches for which there is no exact date
continue
#build a frequency distribution graph by 'Description'
cfdist2.plot()


Previously, we tokenized the text of a file so that we could conduct some analysis. Let's now tokenize just the body of the file, not the metadata. As an exersize, let's see how the modal verbs 'must', 'should' and 'will' occur in the text.

In [ ]:
#tokenize the body of the text so that we can start to analyse it
tokens = word_tokenize(data[1])
tokens.count('should')


For each file, tokenize the body then count how often 'must', 'will' and 'should' occurs in each

In [ ]:
for filename in os.listdir('UMA_Fraser_Radio_Talks'):
#split text of file on 'end metadata'
#skip all speeches for which there is no exact date
continue
#tokenise the text of the speech
tokens = word_tokenize(text[1].decode('ISO-8859-1'))
#show the date of each speech count how often 'should' and 'must' are used in each
print metadata['Date'], ',', tokens.count('should'), ',', tokens.count('must'), ',', tokens.count('will')


And graph that

In [ ]:
cfdist3 = ConditionalFreqDist()
if date[0] == 'c':
year = date[1:]
elif date[0] != 'c':
year = date.split('/')[-1]
if year == '1966':
continue
tokens = word_tokenize(text[1].decode('ISO-8859-1'))
cfdist3['should'][year] += tokens.count('should')
cfdist3['will'][year] += tokens.count('will')
cfdist3['must'][year] += tokens.count('must')
cfdist3.plot()

In [ ]:
cfdist3 = ConditionalFreqDist()
if date[0] == 'c':
year = date[1:]
elif date[0] != 'c':
year = date.split('/')[-1]
if year == '1966':
continue
tokens = word_tokenize(text[1].decode('ISO-8859-1'))
if len(tokens) == 0:
continue
cfdist3['should'][year] += tokens.count('should') / len(tokens)
cfdist3['will'][year] += tokens.count('will') / len(tokens)
cfdist3['must'][year] += tokens.count('must') / len(tokens)
cfdist3.plot()

In [ ]: