In the block below, change to the directory the zip file you just downloaded, which contained this ipython notebook.
import os
#os.chdir('/path/to/extracted/fileformat.zip')
'''
os.chdir expects a string object as an argument
the string can be either a path relative to your current working directory or an absolute path
I'm passing an absolute path below:
'''
os.chdir('/Users/rweiss/Dropbox/presentations/IRiSS2013/text1/fileformats/')
'''
os.listdir also expects a string object
os.getcwd() returns the current working path as a string object
so, you can pass os.getcwd() as an argument to os.listdir...when the code is interpreted, python will handle the returns for you
note that os.listdir() returns a list of strings, where each element in the list is a string representing the filename of the
files in the directory
'''
os.listdir(os.getcwd())
['.DS_Store', 'amazon', 'example.xls', 'example.xlsx', 'fileformats_lecture.ipynb', 'jstor.zip', 'syllabus.docx']
'''
csv is a module in the python standard library
you need to import csv in order to gain access to the methods and objects available to you in the csv module
'''
import csv
'''
this code will print the first line in the file 'sociology_2010.csv'
because we are using csv.reader and the file has field names, the first line are field names
'''
with open('amazon/sociology_2010.csv', 'rb') as csvfile: # this will automatically call csvfile.close() when the code below completes
amazon_reader = csv.reader(csvfile) #this instantiates a csv.reader object called "amazon_reader".
#csv.reader will preloaded with the values from the file argument "csvfile", which is the data from the file "sociology_2010.csv" above.
for row in amazon_reader: # amazon_reader is an iterable object, so you can write the expression "for VALUE in ITERABLE" and whatever code you write that
#affects VALUE will affect every iterable element in the iterable object (in this case, the csv.reader)
print row #prints whatever is the value of row
break # this just breaks after one iteration of the loop. the file is big, so i didn't want it to fill the screen
['doi', 'title', 'amazon_id', 'review_text']
import csv
with open('amazon/sociology_2010.csv', 'rb') as csvfile:
amazon_reader = csv.DictReader(csvfile, delimiter=',', quotechar='\"') #a DictReader has more methods available to it than a regular csv.reader()
#though csv.readers also can have delimiters and quotechars specified...this stuff is listed in the python docs online.
for row in amazon_reader:
print row #because this is a DictReader, every row in the csv is represented as a dictionary, not as a list (as per the regular csv.reader)
break
{'review_text': "This book was given to me by my best friend after finishing college. I will always treasure this thoughtful and special gift.<p>_Girlfriends_ is a collection of stories that explore and celebrate female friendship through the eyes, ears, and hearts of everyday women. Some of the women were friends for a lifetime, others for a short time. However, all understood and/or demonstrated the meaning of "true friendship." For example, the stories included everything from the thankful musings of a once-ill woman about the extraordinaty kindness of her girlfriends to a giggly account of how two eerily-simiar best friends met as assigned roomates their first day of college. (The latter tale struck very close to home in a wonderfully spooky way.)<p>While many of the stories tugged at the heartstrings, I never felt manipulated by the authors. (Note: Part of the reason why I don't like the "Chicken Soup for the Soul" series is that I feel that the authors are just dying to make the reader clutch for the box of tissues.) Rather, I appreciated the "real" tone of the stories, as they read like good conversation shared over a nice pot of Hazlenut coffee.<p>Some readers have commented on the book's simple language and lack of depth. I don't think the goal here was to explore the psychology of friendship, rather I think it was intended to be a simple and beautiful celebration meant to be enjoyed by "Girlfriends" everywhere. Enjoy!", 'doi': '10.2307/25677732', 'amazon_id': '1885171080', 'title': 'Invisible Girlfriend'}
print amazon_reader.fieldnames #DictReaders have a member method "fieldnames" which returns a list of strings corresponding to the first row of the file
['doi', 'title', 'amazon_id', 'review_text']
with open('amazon/sociology_2010.csv', 'rb') as csvfile:
amazon_reader = csv.DictReader(csvfile, delimiter=',')
for row in amazon_reader:
print row['title'] #because they are dicts, you can refer to the value by the key; in this case, the "title" field
break
#print [row['title'] for row in amazon_reader]
Invisible Girlfriend
#solution 1
with open('amazon/sociology_2010.csv', 'rb') as csvfile:
amazon_reader = csv.DictReader(csvfile, delimiter=',')
amazon_reviews = [row['review_text'] for row in amazon_reader]
print len(amazon_reviews)
#solution 2
with open('amazon/sociology_2010.csv', 'rb') as csvfile:
amazon_reader = csv.DictReader(csvfile, delimiter=',')
amazon_review_dicts = [{row['doi']: row['review_text']} for row in amazon_reader]
#doc_tf_vectors = [tf(term, text) for term, text in amazon_reviews]
381 0.366204 <= #1 0.000000 <= TEXT 0.000000 <= DOCUMENT
.xlsx
files¶You should be able to install openpyxl
if it is not on your machine already using either pip
or easy_install
. If you are using Canopy, you can install these through the GUI: Tools > Package Manager > Search for "openpyxl" and install. Otherwise, enter one of the following in the shell (I prefer pip
).
sudo pip install openpyxl
sudo easy_install openpyxl
from openpyxl import load_workbook
import csv, sys
wb = load_workbook('example.xlsx')
print type(wb)
#print dir(wb)
# create object from xlsx workbook
wb.get_sheet_names() # print out all sheet names
print type(wb.get_sheet_names())
print type(wb.get_sheet_names()[0])
print type(wb.worksheets)
for sheet in wb.worksheets:
print type(sheet)
# print sheet.title # another way of printing all sheet names
<class 'openpyxl.workbook.Workbook'> <type 'list'> <type 'str'> <type 'list'> <class 'openpyxl.worksheet.Worksheet'> <class 'openpyxl.worksheet.Worksheet'>
sheet1 = wb.worksheets[0]
print sheet1# grab the first sheet (also look up .get_sheet_by_name())
print sheet1.cell('A1').value # print the text value of cell A1
<Worksheet "First"> This is the first cell, A1, sheet "First"
row2 = sheet1.rows[1] # create tuple of Cell objects
for cell in row2:
print cell.value # print all the text values of every cell in the row
THis is the first column, second row A2, sheet "First" THis is the second column, second row B2, sheet "First" THis is the third column, second row C2, sheet "First"
Exercises on your own time:
.get_sheet_names()
is a member method of Workbook
objects. What are some other methods for Workbook
objects? (hint, what does dir(wb)
return?)openpyxl
by reading the documentationprint dir(wb)
print wb.encoding
print type(wb.encoding)
print wb.properties
print dir(wb.properties)
print wb.properties.creator
['_Workbook__optimized_read', '_Workbook__optimized_write', '_Workbook__thread_local_data', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_active_sheet_index', '_local_data', '_named_ranges', '_set_optimized_read', 'add_named_range', 'add_sheet', 'create_named_range', 'create_sheet', 'encoding', 'excel_base_date', 'get_active_sheet', 'get_index', 'get_named_range', 'get_named_ranges', 'get_sheet_by_name', 'get_sheet_names', 'loaded_theme', 'properties', 'remove_named_range', 'remove_sheet', 'save', 'security', 'strings_table_builder', 'style', 'worksheets'] utf-8 <type 'str'> <openpyxl.workbook.DocumentProperties object at 0x10e4b9510> ['__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'category', 'company', 'created', 'creator', 'description', 'excel_base_date', 'keywords', 'last_modified_by', 'modified', 'subject', 'title'] Unknown
.xls
files¶You should be able to install xlutils
if it is not on your machine already using either pip
or easy_install
. If you are using Canopy, you can install these through the GUI: Tools > Package Manager > Search for "xlutils" and install. Otherwise, enter one of the following in the shell (I prefer pip
).
sudo pip install xlutils
sudo easy_install xlutils
from xlrd import open_workbook
wb = open_workbook('example.xls') # create object from xls workbook
print wb.sheets()
for sheet in wb.sheets():
print sheet.name # printing all sheet names
[<xlrd.sheet.Sheet object at 0x10e4b92d0>, <xlrd.sheet.Sheet object at 0x101067850>] First Second
sheet1 = wb.sheets()[0] # grab the first sheet, also look up .sheet_by_index(NUM)
print sheet1.cell(1,1)# print the text value of cell A1
row2 = sheet1.row(1) # create tuple of Cell objects
for cell in row2:
print cell.value # print all the text values of every cell in the row
text:u'THis is the second column, second row B2, sheet "First"' THis is the first column, second row A2, sheet "First" THis is the second column, second row B2, sheet "First" THis is the third column, second row C2, sheet "First"
Exercises on your own time:
.sheets()
is actually a method that returns a list. What are other methods available to Book
objects?.cell()
, .row()
and .value
are actually methods too. What do they return?You can learn more about xlutils
in the documentation.
print dir(wb)
print wb.encoding
print wb.nsheets
print type(wb.sheets())
print type(wb.sheets()[0])
print type(sheet1.row(1))
print type(row2[0])
['__class__', '__delattr__', '__dict__', '__doc__', '__enter__', '__exit__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_all_sheets_count', '_all_sheets_map', '_externsheet_info', '_externsheet_type_b57', '_extnsht_count', '_extnsht_name_from_num', '_position', '_repr_these', '_resources_released', '_rich_text_runlist_map', '_sh_abs_posn', '_sharedstrings', '_sheet_list', '_sheet_names', '_sheet_num_from_name', '_sheet_visibility', '_sheethdr_count', '_supbook_addins_inx', '_supbook_count', '_supbook_locals_inx', '_supbook_types', '_xf_epilogue_done', '_xf_index_to_xl_type_map', 'actualfmtcount', 'addin_func_names', 'base', 'biff2_8_load', 'biff_version', 'builtinfmtcount', 'codepage', 'colour_indexes_used', 'colour_map', 'countries', 'datemode', 'derive_encoding', 'dump', 'encoding', 'encoding_override', 'fake_globals_get_sheet', 'filestr', 'font_list', 'format_list', 'format_map', 'formatting_info', 'get2bytes', 'get_record_parts', 'get_record_parts_conditional', 'get_sheet', 'get_sheets', 'getbof', 'handle_boundsheet', 'handle_builtinfmtcount', 'handle_codepage', 'handle_country', 'handle_datemode', 'handle_efont', 'handle_externname', 'handle_externsheet', 'handle_filepass', 'handle_font', 'handle_format', 'handle_name', 'handle_obj', 'handle_palette', 'handle_sheethdr', 'handle_sheetsoffset', 'handle_sst', 'handle_style', 'handle_supbook', 'handle_writeaccess', 'handle_xf', 'initialise_format_info', 'is_date_format_string', 'load_time_stage_1', 'load_time_stage_2', 'logfile', 'mem', 'name_and_scope_map', 'name_map', 'name_obj_list', 'names_epilogue', 'nsheets', 'on_demand', 'palette_epilogue', 'palette_record', 'parse_globals', 'pickleable', 'ragged_rows', 'raw_user_name', 'read', 'release_resources', 'sheet_by_index', 'sheet_by_name', 'sheet_loaded', 'sheet_names', 'sheets', 'stream_len', 'style_name_map', 'unload_sheet', 'use_mmap', 'user_name', 'verbosity', 'xf_epilogue', 'xf_list', 'xfcount'] utf_16_le 2 <type 'list'> <class 'xlrd.sheet.Sheet'> <type 'list'> <class 'xlrd.sheet.Cell'>
Instead of a pile of Excel files, maybe you have a collection of Word files. Let's say you've got a bunch of student essays, or maybe first person reports, or government reports, etc.
If you are lucky, you will be working with .docx
files. If you aren't working with .docx
, you might want to try wv
, but to be honest, it would be better to try to convert them to .docx
using a macro or some other process (this could be a very good teach-yourself-python project!).
Why .docx
? Office 2007 and later use formats called Office Open XML. .docx
and .xlsx
and all those other formats are actually zipped xml files. That's what the x at the end stands for.
That means you can use a third-party python library that can parse XML files (you'll learn more about parsing in the Web Scraping module). Here's some example code using python-docx
How to install
python-docx
This module is not in
pypi
, which means that you have to install it manually. On the page linked above, you can eithergit clone
the file or just download the zipped archive.
- Unzip the archive somewhere and navigate to that directory in the shell
- In the shell, type the following:
~$ python setup.py install
- If you aren't using the Enthought Python distribution, you will probably need to install the following third-party libraries which are in
pypi
~$ sudo pip install lxml
~$ sudo pip install PIL
or
~$ sudo easy_install lxml
~$ sudo easy_install PIL
This is actually a pretty easy module to use. Check out the example code.
from docx import opendocx, getdocumenttext
document = opendocx('syllabus.docx')
print type(document)
paragraphs = getdocumenttext(document)
print type(paragraphs)
print len(paragraphs)
print paragraphs[0]
<type 'lxml.etree._Element'> <type 'list'> 74 IRiSS Computational Social Science Summer Workshop Schedule 2013
print type(paragraphs[0])
<type 'unicode'>
print type(paragraphs[0].encode('utf-8'))
print paragraphs[0]
<type 'str'> IRiSS Computational Social Science Summer Workshop Schedule 2013
import zipfile
import os
zfile = zipfile.ZipFile("jstor.zip")
dirname = 'jstor'
print os.listdir(os.getcwd())
#this would be better as try-except, but that's for future learning
if os.path.isdir(dirname):
pass
else:
os.makedirs(dirname)
print 'Files in directory '+ dirname + ': '+ ', '.join(os.listdir(os.path.join(os.getcwd(), dirname)))
for name in zfile.namelist():
print 'Extracting file ' + name + '...'
fd = open(os.path.join(dirname, name), "w")
fd.write(zfile.read(name))
fd.close()
print 'Files in directory '+ dirname + ': '+ ','.join(os.listdir(os.path.join(os.getcwd(), dirname)))
['.DS_Store', 'amazon', 'example.xls', 'example.xlsx', 'fileformats_lecture.ipynb', 'jstor.zip', 'syllabus.docx'] Files in directory jstor: Extracting file biologicalsciences-discipline-2010-unigrams.txt... Extracting file literature-discipline-2010-unigrams.txt... Extracting file sociology-discipline-2010-unigrams.txt... Files in directory jstor: biologicalsciences-discipline-2010-unigrams.txt,literature-discipline-2010-unigrams.txt,sociology-discipline-2010-unigrams.txt