Notebook

File formats¶

In the block below, change to the directory the zip file you just downloaded, which contained this ipython notebook.

In [1]:

import os
#os.chdir('/path/to/extracted/fileformat.zip')

'''
os.chdir expects a string object as an argument
the string can be either a path relative to your current working directory or an absolute path
I'm passing an absolute path below:
'''

os.chdir('/Users/rweiss/Dropbox/presentations/IRiSS2013/text1/fileformats/')

'''
os.listdir also expects a string object
os.getcwd() returns the current working path as a string object
so, you can pass os.getcwd() as an argument to os.listdir...when the code is interpreted, python will handle the returns for you
note that os.listdir() returns a list of strings, where each element in the list is a string representing the filename of the
files in the directory 
'''
os.listdir(os.getcwd())

Out[1]:

['.DS_Store',
 'amazon',
 'example.xls',
 'example.xlsx',
 'fileformats_lecture.ipynb',
 'jstor.zip',
 'syllabus.docx']

CSV¶

Example 1¶

In [2]:

'''
csv is a module in the python standard library
you need to import csv in order to gain access to the methods and objects available to you in the csv module
'''

import csv

'''
this code will print the first line in the file 'sociology_2010.csv'
because we are using csv.reader and the file has field names, the first line are field names
'''

with open('amazon/sociology_2010.csv', 'rb') as csvfile: # this will automatically call csvfile.close() when the code below completes
    amazon_reader = csv.reader(csvfile) #this instantiates a csv.reader object called "amazon_reader".  
    #csv.reader will preloaded with the values from the file argument "csvfile", which is the data from the file "sociology_2010.csv" above.
    for row in amazon_reader: # amazon_reader is an iterable object, so you can write the expression "for VALUE in ITERABLE" and whatever code you write that
        #affects VALUE will affect every iterable element in the iterable object (in this case, the csv.reader)
        print row #prints whatever is the value of row
        break # this just breaks after one iteration of the loop.  the file is big, so i didn't want it to fill the screen

['doi', 'title', 'amazon_id', 'review_text']

Example 2¶

In [3]:

import csv
with open('amazon/sociology_2010.csv', 'rb') as csvfile:
    amazon_reader = csv.DictReader(csvfile, delimiter=',', quotechar='\"') #a DictReader has more methods available to it than a regular csv.reader()
    #though csv.readers also can have delimiters and quotechars specified...this stuff is listed in the python docs online.
    for row in amazon_reader:
        print row #because this is a DictReader, every row in the csv is represented as a dictionary, not as a list (as per the regular csv.reader)
        break

{'review_text': "This book was given to me by my best friend after finishing college. I will always treasure this thoughtful and special gift.<p>_Girlfriends_ is a collection of stories that explore and celebrate female friendship through the eyes, ears, and hearts of everyday women.  Some of the women were friends for a lifetime, others for a short time. However, all understood and/or demonstrated the meaning of &quot;true friendship.&quot; For example, the stories included everything from the thankful musings of a once-ill woman about the extraordinaty kindness of her girlfriends to a giggly account of how two eerily-simiar best friends met as assigned roomates their first day of college. (The latter tale struck very close to home in a wonderfully spooky way.)<p>While many of the stories tugged at the heartstrings, I never felt manipulated by the authors. (Note: Part of the reason why I don't like the &quot;Chicken Soup for the Soul&quot; series is that I feel that the authors are just dying to make the reader clutch for the box of tissues.)  Rather, I appreciated the &quot;real&quot; tone of the stories, as they read like good conversation shared over a nice pot of Hazlenut coffee.<p>Some readers have commented on the book's simple language and lack of depth. I don't think the goal here was to explore the psychology of friendship, rather I think it was intended to be a simple and beautiful celebration meant to be enjoyed by &quot;Girlfriends&quot; everywhere.  Enjoy!", 'doi': '10.2307/25677732', 'amazon_id': '1885171080', 'title': 'Invisible Girlfriend'}

In [4]:

print amazon_reader.fieldnames #DictReaders have a member method "fieldnames" which returns a list of strings corresponding to the first row of the file

['doi', 'title', 'amazon_id', 'review_text']

In [15]:

with open('amazon/sociology_2010.csv', 'rb') as csvfile:
    amazon_reader = csv.DictReader(csvfile, delimiter=',')
    for row in amazon_reader:
        print row['title'] #because they are dicts, you can refer to the value by the key; in this case, the "title" field
        break
    #print [row['title'] for row in amazon_reader]
    

Invisible Girlfriend

Exercises on your own time¶

Create a list of the text from all Amazon reviews
Create a list of dictionaries from all Amazon reviews, where each dictionary is a mapping of unique ID to text body (hint, {doi: review_text})

In [79]:

#solution 1
with open('amazon/sociology_2010.csv', 'rb') as csvfile:
    amazon_reader = csv.DictReader(csvfile, delimiter=',')
    amazon_reviews = [row['review_text'] for row in amazon_reader]

print len(amazon_reviews)

#solution 2
with open('amazon/sociology_2010.csv', 'rb') as csvfile:
    amazon_reader = csv.DictReader(csvfile, delimiter=',')
    amazon_review_dicts = [{row['doi']: row['review_text']} for row in amazon_reader]

    
    
#doc_tf_vectors = [tf(term, text) for term, text in amazon_reviews]

381
0.366204 <= #1
0.000000 <= TEXT
0.000000 <= DOCUMENT

Microsoft Excel¶

openpyxl: working with `.xlsx` files¶

You should be able to install openpyxl if it is not on your machine already using either pip or easy_install. If you are using Canopy, you can install these through the GUI: Tools > Package Manager > Search for "openpyxl" and install. Otherwise, enter one of the following in the shell (I prefer pip).

sudo pip install openpyxl

sudo easy_install openpyxl

Example 3¶

In [71]:

from openpyxl import load_workbook
import csv, sys

wb = load_workbook('example.xlsx')
print type(wb)
#print dir(wb)
# create object from xlsx workbook
wb.get_sheet_names() # print out all sheet names
print type(wb.get_sheet_names())
print type(wb.get_sheet_names()[0])
print type(wb.worksheets)
for sheet in wb.worksheets: 
    print type(sheet)
#	print sheet.title # another way of printing all sheet names

<class 'openpyxl.workbook.Workbook'>
<type 'list'>
<type 'str'>
<type 'list'>
<class 'openpyxl.worksheet.Worksheet'>
<class 'openpyxl.worksheet.Worksheet'>

In [73]:

sheet1 = wb.worksheets[0]
print sheet1# grab the first sheet (also look up .get_sheet_by_name())
print sheet1.cell('A1').value # print the text value of cell A1

<Worksheet "First">
This is the first cell, A1, sheet "First"

In [8]:

row2 = sheet1.rows[1] # create tuple of Cell objects
for cell in row2:
	print cell.value # print all the text values of every cell in the row

THis is the first column, second row A2, sheet "First"
THis is the second column, second row B2,  sheet "First"
THis is the third column, second row C2,  sheet "First"

Exercises on your own time:

.get_sheet_names() is a member method of Workbook objects. What are some other methods for Workbook objects? (hint, what does dir(wb) return?)
Take one of the methods that you discovered in question 1. What kind of objects do they return?
You can learn a lot more about openpyxl by reading the documentation

In [35]:

print dir(wb)

print wb.encoding
print type(wb.encoding)
print wb.properties
print dir(wb.properties)
print wb.properties.creator

['_Workbook__optimized_read', '_Workbook__optimized_write', '_Workbook__thread_local_data', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_active_sheet_index', '_local_data', '_named_ranges', '_set_optimized_read', 'add_named_range', 'add_sheet', 'create_named_range', 'create_sheet', 'encoding', 'excel_base_date', 'get_active_sheet', 'get_index', 'get_named_range', 'get_named_ranges', 'get_sheet_by_name', 'get_sheet_names', 'loaded_theme', 'properties', 'remove_named_range', 'remove_sheet', 'save', 'security', 'strings_table_builder', 'style', 'worksheets']
utf-8
<type 'str'>
<openpyxl.workbook.DocumentProperties object at 0x10e4b9510>
['__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'category', 'company', 'created', 'creator', 'description', 'excel_base_date', 'keywords', 'last_modified_by', 'modified', 'subject', 'title']
Unknown

xlutils: working with `.xls` files¶

You should be able to install xlutils if it is not on your machine already using either pip or easy_install. If you are using Canopy, you can install these through the GUI: Tools > Package Manager > Search for "xlutils" and install. Otherwise, enter one of the following in the shell (I prefer pip).

sudo pip install xlutils

sudo easy_install xlutils

Example 4¶

In [36]:

from xlrd import open_workbook

wb = open_workbook('example.xls') # create object from xls workbook
print wb.sheets()
for sheet in wb.sheets():
    print sheet.name # printing all sheet names

[<xlrd.sheet.Sheet object at 0x10e4b92d0>, <xlrd.sheet.Sheet object at 0x101067850>]
First
Second

In [10]:

sheet1 = wb.sheets()[0] # grab the first sheet, also look up .sheet_by_index(NUM)
print sheet1.cell(1,1)# print the text value of cell A1
row2 = sheet1.row(1) # create tuple of Cell objects

for cell in row2:
    print cell.value # print all the text values of every cell in the row

text:u'THis is the second column, second row B2,  sheet "First"'
THis is the first column, second row A2, sheet "First"
THis is the second column, second row B2,  sheet "First"
THis is the third column, second row C2,  sheet "First"

Exercises on your own time:

.sheets() is actually a method that returns a list. What are other methods available to Book objects?
.cell(), .row() and .value are actually methods too. What do they return?

You can learn more about xlutils in the documentation.

In [56]:

print dir(wb)

print wb.encoding
print wb.nsheets
print type(wb.sheets())
print type(wb.sheets()[0])
print type(sheet1.row(1))
print type(row2[0])

['__class__', '__delattr__', '__dict__', '__doc__', '__enter__', '__exit__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_all_sheets_count', '_all_sheets_map', '_externsheet_info', '_externsheet_type_b57', '_extnsht_count', '_extnsht_name_from_num', '_position', '_repr_these', '_resources_released', '_rich_text_runlist_map', '_sh_abs_posn', '_sharedstrings', '_sheet_list', '_sheet_names', '_sheet_num_from_name', '_sheet_visibility', '_sheethdr_count', '_supbook_addins_inx', '_supbook_count', '_supbook_locals_inx', '_supbook_types', '_xf_epilogue_done', '_xf_index_to_xl_type_map', 'actualfmtcount', 'addin_func_names', 'base', 'biff2_8_load', 'biff_version', 'builtinfmtcount', 'codepage', 'colour_indexes_used', 'colour_map', 'countries', 'datemode', 'derive_encoding', 'dump', 'encoding', 'encoding_override', 'fake_globals_get_sheet', 'filestr', 'font_list', 'format_list', 'format_map', 'formatting_info', 'get2bytes', 'get_record_parts', 'get_record_parts_conditional', 'get_sheet', 'get_sheets', 'getbof', 'handle_boundsheet', 'handle_builtinfmtcount', 'handle_codepage', 'handle_country', 'handle_datemode', 'handle_efont', 'handle_externname', 'handle_externsheet', 'handle_filepass', 'handle_font', 'handle_format', 'handle_name', 'handle_obj', 'handle_palette', 'handle_sheethdr', 'handle_sheetsoffset', 'handle_sst', 'handle_style', 'handle_supbook', 'handle_writeaccess', 'handle_xf', 'initialise_format_info', 'is_date_format_string', 'load_time_stage_1', 'load_time_stage_2', 'logfile', 'mem', 'name_and_scope_map', 'name_map', 'name_obj_list', 'names_epilogue', 'nsheets', 'on_demand', 'palette_epilogue', 'palette_record', 'parse_globals', 'pickleable', 'ragged_rows', 'raw_user_name', 'read', 'release_resources', 'sheet_by_index', 'sheet_by_name', 'sheet_loaded', 'sheet_names', 'sheets', 'stream_len', 'style_name_map', 'unload_sheet', 'use_mmap', 'user_name', 'verbosity', 'xf_epilogue', 'xf_list', 'xfcount']
utf_16_le
2
<type 'list'>
<class 'xlrd.sheet.Sheet'>
<type 'list'>
<class 'xlrd.sheet.Cell'>

Microsoft Word¶

Instead of a pile of Excel files, maybe you have a collection of Word files. Let's say you've got a bunch of student essays, or maybe first person reports, or government reports, etc.

If you are lucky, you will be working with .docx files. If you aren't working with .docx, you might want to try wv, but to be honest, it would be better to try to convert them to .docx using a macro or some other process (this could be a very good teach-yourself-python project!).

Why .docx? Office 2007 and later use formats called Office Open XML. .docx and .xlsx and all those other formats are actually zipped xml files. That's what the x at the end stands for.

That means you can use a third-party python library that can parse XML files (you'll learn more about parsing in the Web Scraping module). Here's some example code using python-docx

How to install python-docx

This module is not in pypi, which means that you have to install it manually. On the page linked above, you can either git clone the file or just download the zipped archive.

Unzip the archive somewhere and navigate to that directory in the shell

In the shell, type the following:

~$ python setup.py install

If you aren't using the Enthought Python distribution, you will probably need to install the following third-party libraries which are in pypi

~$ sudo pip install lxml

~$ sudo pip install PIL

~$ sudo easy_install lxml

~$ sudo easy_install PIL

Using python-docx¶

This is actually a pretty easy module to use. Check out the example code.

Example 5¶

In [11]:

from docx import opendocx, getdocumenttext

document = opendocx('syllabus.docx')
print type(document)
paragraphs = getdocumenttext(document)
print type(paragraphs)
print len(paragraphs)
print paragraphs[0]

<type 'lxml.etree._Element'>
<type 'list'>
74
IRiSS Computational Social Science Summer Workshop Schedule 2013

In [12]:

print type(paragraphs[0])

<type 'unicode'>

In [13]:

print type(paragraphs[0].encode('utf-8'))
print paragraphs[0]

<type 'str'>
IRiSS Computational Social Science Summer Workshop Schedule 2013

Tarfiles and Zipfiles¶

Example 6¶

In [14]:

import zipfile
import os

zfile = zipfile.ZipFile("jstor.zip")
dirname = 'jstor'

print os.listdir(os.getcwd())

#this would be better as try-except, but that's for future learning
if os.path.isdir(dirname):
    pass
else:
    os.makedirs(dirname)

print 'Files in directory '+ dirname + ': '+ ', '.join(os.listdir(os.path.join(os.getcwd(), dirname)))

for name in zfile.namelist():
    print 'Extracting file ' + name + '...'
    fd = open(os.path.join(dirname, name), "w")
    fd.write(zfile.read(name))
    fd.close()

print 'Files in directory '+ dirname + ': '+ ','.join(os.listdir(os.path.join(os.getcwd(), dirname)))

['.DS_Store', 'amazon', 'example.xls', 'example.xlsx', 'fileformats_lecture.ipynb', 'jstor.zip', 'syllabus.docx']
Files in directory jstor: 
Extracting file biologicalsciences-discipline-2010-unigrams.txt...
Extracting file literature-discipline-2010-unigrams.txt...
Extracting file sociology-discipline-2010-unigrams.txt...
Files in directory jstor: biologicalsciences-discipline-2010-unigrams.txt,literature-discipline-2010-unigrams.txt,sociology-discipline-2010-unigrams.txt