Getting data¶

Getting and setting directory, equivalent to R getwd() and setwd() commands

This is the more general Python way. In IPython we can use system commands directly prefixed by an exclamation mark (!)

In [2]:

import os

os.getcwd()

Out[2]:

'/Users/erriza/dataanalysis/week2'

In [3]:

os.chdir('..')
os.getcwd()

Out[3]:

'/Users/erriza/dataanalysis'

In [4]:

os.chdir('./week2/')
os.getcwd()

Out[4]:

'/Users/erriza/dataanalysis/week2'

Load CSV data. We'll use mostly pandas.

In [7]:

import pandas as pd

fileUrl = 'https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD'

# we can directly use read_csv to download the file
# this is equivalent to R's combined download.file() and read.table() or read.csv() commands
cameraData = pd.read_csv(fileUrl)

# save data locally
cameraData.to_csv('../data/cameras.csv', index=False)

# for simplicity I'll use IPython tricks to list folder contents
!ls ../data


# get current date and time
# this is equivalent to R date() command
# note that I use IPython ! prefix to run my system's command
dateDownloaded = !date
print '\nDate downloaded: ' + str(dateDownloaded)

cameraData.head()

camera.xls          face.rda            loansData.csv       samsungData.csv
camera.xlsx         gaData.csv          movies.txt          samsungData.rda
cameras.csv         gaData.rda          ravensData.csv      ss06pid.csv
camerasModified.csv galton.csv          ravensData.rda      warpbreaks.csv

Date downloaded: ['Mon Mar 18 21:29:11 CET 2013']

Out[7]:

	address	direction	street	crossStreet	intersection	Location 1
0	S CATON AVE & BENSON AVE	N/B	Caton Ave	Benson Ave	Caton Ave & Benson Ave	(39.2693779962, -76.6688185297)
1	S CATON AVE & BENSON AVE	S/B	Caton Ave	Benson Ave	Caton Ave & Benson Ave	(39.2693157898, -76.6689698176)
2	WILKENS AVE & PINE HEIGHTS AVE	E/B	Wilkens Ave	Pine Heights	Wilkens Ave & Pine Heights	(39.2720252302, -76.676960806)
3	THE ALAMEDA & E 33RD ST	S/B	The Alameda	33rd St	The Alameda & 33rd St	(39.3285013141, -76.5953545714)
4	E 33RD ST & THE ALAMEDA	E/B	E 33rd	The Alameda	E 33rd & The Alameda	(39.3283410623, -76.5953594625)

Read Excel file

This is equivalent to R read.xlsx() and read.xlsx2() commands.

We need openpyxl 1.5.8 (don't use the latest version due to a bug) and xlrd packages. Install with:


sudo pip install openpyxl==1.5.8
sudo pip install xlrd

Pandas ExcelFile() can't download and read at once (in contrast to read_csv()), so we need to resort to the basic Python way. Also notice I'm using .xls; .xlsx doesn't work in my computer.

In [8]:

import urllib2

# download the file as camera.xls and save it in ./data subfolder
fileUrl = 'https://data.baltimorecity.gov/api/views/dz54-2aru/rows.xls?accessType=DOWNLOAD'
f = urllib2.urlopen(fileUrl)
data = f.read()
with open('../data/camera.xls', 'wb') as w:
    w.write(data)

# load the Excel file as a pandas DataFrame
cameraData = pd.ExcelFile('../data/camera.xls')
cameraData = cameraData.parse('Baltimore Fixed Speed Cameras', index_col=None, na_values=['NA'])
cameraData.head()

Out[8]:

	address	direction	street	crossStreet	intersection	Location 1
0	S CATON AVE & BENSON AVE	N/B	Caton Ave	Benson Ave	Caton Ave & Benson Ave	(39.2693779962, -76.6688185297)
1	S CATON AVE & BENSON AVE	S/B	Caton Ave	Benson Ave	Caton Ave & Benson Ave	(39.2693157898, -76.6689698176)
2	WILKENS AVE & PINE HEIGHTS AVE	E/B	Wilkens Ave	Pine Heights	Wilkens Ave & Pine Heights	(39.2720252302, -76.676960806)
3	THE ALAMEDA & E 33RD ST	S/B	The Alameda	33rd St	The Alameda & 33rd St	(39.3285013141, -76.5953545714)
4	E 33RD ST & THE ALAMEDA	E/B	E 33rd	The Alameda	E 33rd & The Alameda	(39.3283410623, -76.5953594625)

The course video describes R's readLines() for reading a text file, which is similar to standard Python file access, so I'm not going to detail it here.

Similarly, R's readLines() to read data from a website is similar to Python with urllib2 package as in xls example above`

Read JSON file

This is equivalent to R's fromJSON() command.

In [9]:

import json

# first we get the json file from the website
fileUrl = 'https://data.baltimorecity.gov/api/views/dz54-2aru/rows.json?accessType=DOWNLOAD'
req = urllib2.Request(fileUrl)
opener = urllib2.build_opener()
f = opener.open(req)

# then we read it into a data structure
jsonCamera = json.loads(f.read())

# json is loadad as dictionary
print jsonCamera['meta']['view']['id']
print jsonCamera['meta']['view']['name']
print jsonCamera['meta']['view']['attribution']

dz54-2aru
Baltimore Fixed Speed Cameras
Department of Transportation

Writing data

In [10]:

# first read the csv file
cameraData = pd.read_csv('../data/cameras.csv')

# take a subset of the columns
tmpData = cameraData.ix[:,2:]

# then save it to a different csv file
# this is equivalent to R's write.table() command
tmpData.to_csv('../data/camerasModified.csv', sep=',', index=False)

cameraData2 = pd.read_csv('../data/camerasModified.csv')
cameraData2.head()

Out[10]:

	street	crossStreet	intersection	Location 1
0	Caton Ave	Benson Ave	Caton Ave & Benson Ave	(39.2693779962, -76.6688185297)
1	Caton Ave	Benson Ave	Caton Ave & Benson Ave	(39.2693157898, -76.6689698176)
2	Wilkens Ave	Pine Heights	Wilkens Ave & Pine Heights	(39.2720252302, -76.676960806)
3	The Alameda	33rd St	The Alameda & 33rd St	(39.3285013141, -76.5953545714)
4	E 33rd	The Alameda	E 33rd & The Alameda	(39.3283410623, -76.5953594625)

The course video explains R commands to save and load the workspace. I don't think we have the equivalent for that.

The course video explains R's paste() and paste0() commands, which look like standard Python's string manipulations:

In [11]:

print ['../data' + str(i) + '.csv' for i in range(1, 6)]

['../data1.csv', '../data2.csv', '../data3.csv', '../data4.csv', '../data5.csv']

Getting data off webpages

In [12]:

from lxml.html import parse

url = 'http://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en'

# this is equivalent to the combined R's opening/reading/closing connection and htmlTreeParse() commands
html3 = parse(url).getroot()

# get the title text using xpath expression
# this is equivalent to R xpathSApply() command
title = html3.xpath('//title')
print [x.text_content() for x in title]

# get the texts of col-citedby elements using xpath expression
citedby = html3.xpath("//td[@id='col-citedby']")
print [x.text_content() for x in citedby]

['Jeff Leek - Google Scholar Citations']
['Cited by', '344', '183', '147', '143', '111', '96', '87', '80', '59', '18', '11', '10', '10', '8', '8', '8', '7', '6', '5', '3']

In [ ]: