Getting and setting directory, equivalent to R getwd() and setwd() commands
This is the more general Python way. In IPython we can use system commands directly prefixed by an exclamation mark (!)
import os
os.getcwd()
'/Users/erriza/dataanalysis/week2'
os.chdir('..')
os.getcwd()
'/Users/erriza/dataanalysis'
os.chdir('./week2/')
os.getcwd()
'/Users/erriza/dataanalysis/week2'
Load CSV data. We'll use mostly pandas
.
import pandas as pd
fileUrl = 'https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD'
# we can directly use read_csv to download the file
# this is equivalent to R's combined download.file() and read.table() or read.csv() commands
cameraData = pd.read_csv(fileUrl)
# save data locally
cameraData.to_csv('../data/cameras.csv', index=False)
# for simplicity I'll use IPython tricks to list folder contents
!ls ../data
# get current date and time
# this is equivalent to R date() command
# note that I use IPython ! prefix to run my system's command
dateDownloaded = !date
print '\nDate downloaded: ' + str(dateDownloaded)
cameraData.head()
camera.xls face.rda loansData.csv samsungData.csv camera.xlsx gaData.csv movies.txt samsungData.rda cameras.csv gaData.rda ravensData.csv ss06pid.csv camerasModified.csv galton.csv ravensData.rda warpbreaks.csv Date downloaded: ['Mon Mar 18 21:29:11 CET 2013']
address | direction | street | crossStreet | intersection | Location 1 | |
---|---|---|---|---|---|---|
0 | S CATON AVE & BENSON AVE | N/B | Caton Ave | Benson Ave | Caton Ave & Benson Ave | (39.2693779962, -76.6688185297) |
1 | S CATON AVE & BENSON AVE | S/B | Caton Ave | Benson Ave | Caton Ave & Benson Ave | (39.2693157898, -76.6689698176) |
2 | WILKENS AVE & PINE HEIGHTS AVE | E/B | Wilkens Ave | Pine Heights | Wilkens Ave & Pine Heights | (39.2720252302, -76.676960806) |
3 | THE ALAMEDA & E 33RD ST | S/B | The Alameda | 33rd St | The Alameda & 33rd St | (39.3285013141, -76.5953545714) |
4 | E 33RD ST & THE ALAMEDA | E/B | E 33rd | The Alameda | E 33rd & The Alameda | (39.3283410623, -76.5953594625) |
Read Excel file
This is equivalent to R read.xlsx()
and read.xlsx2()
commands.
We need openpyxl 1.5.8 (don't use the latest version due to a bug) and xlrd packages. Install with:
sudo pip install openpyxl==1.5.8
sudo pip install xlrd
Pandas ExcelFile()
can't download and read at once (in contrast to read_csv()
), so we need to resort to the basic Python way.
Also notice I'm using .xls; .xlsx doesn't work in my computer.
import urllib2
# download the file as camera.xls and save it in ./data subfolder
fileUrl = 'https://data.baltimorecity.gov/api/views/dz54-2aru/rows.xls?accessType=DOWNLOAD'
f = urllib2.urlopen(fileUrl)
data = f.read()
with open('../data/camera.xls', 'wb') as w:
w.write(data)
# load the Excel file as a pandas DataFrame
cameraData = pd.ExcelFile('../data/camera.xls')
cameraData = cameraData.parse('Baltimore Fixed Speed Cameras', index_col=None, na_values=['NA'])
cameraData.head()
address | direction | street | crossStreet | intersection | Location 1 | |
---|---|---|---|---|---|---|
0 | S CATON AVE & BENSON AVE | N/B | Caton Ave | Benson Ave | Caton Ave & Benson Ave | (39.2693779962, -76.6688185297) |
1 | S CATON AVE & BENSON AVE | S/B | Caton Ave | Benson Ave | Caton Ave & Benson Ave | (39.2693157898, -76.6689698176) |
2 | WILKENS AVE & PINE HEIGHTS AVE | E/B | Wilkens Ave | Pine Heights | Wilkens Ave & Pine Heights | (39.2720252302, -76.676960806) |
3 | THE ALAMEDA & E 33RD ST | S/B | The Alameda | 33rd St | The Alameda & 33rd St | (39.3285013141, -76.5953545714) |
4 | E 33RD ST & THE ALAMEDA | E/B | E 33rd | The Alameda | E 33rd & The Alameda | (39.3283410623, -76.5953594625) |
The course video describes R's readLines()
for reading a text file, which is similar to standard Python file access, so I'm not going to detail it here.
Similarly, R's readLines()
to read data from a website is similar to Python with urllib2 package as in xls example above`
Read JSON file
This is equivalent to R's fromJSON()
command.
import json
# first we get the json file from the website
fileUrl = 'https://data.baltimorecity.gov/api/views/dz54-2aru/rows.json?accessType=DOWNLOAD'
req = urllib2.Request(fileUrl)
opener = urllib2.build_opener()
f = opener.open(req)
# then we read it into a data structure
jsonCamera = json.loads(f.read())
# json is loadad as dictionary
print jsonCamera['meta']['view']['id']
print jsonCamera['meta']['view']['name']
print jsonCamera['meta']['view']['attribution']
dz54-2aru Baltimore Fixed Speed Cameras Department of Transportation
Writing data
# first read the csv file
cameraData = pd.read_csv('../data/cameras.csv')
# take a subset of the columns
tmpData = cameraData.ix[:,2:]
# then save it to a different csv file
# this is equivalent to R's write.table() command
tmpData.to_csv('../data/camerasModified.csv', sep=',', index=False)
cameraData2 = pd.read_csv('../data/camerasModified.csv')
cameraData2.head()
street | crossStreet | intersection | Location 1 | |
---|---|---|---|---|
0 | Caton Ave | Benson Ave | Caton Ave & Benson Ave | (39.2693779962, -76.6688185297) |
1 | Caton Ave | Benson Ave | Caton Ave & Benson Ave | (39.2693157898, -76.6689698176) |
2 | Wilkens Ave | Pine Heights | Wilkens Ave & Pine Heights | (39.2720252302, -76.676960806) |
3 | The Alameda | 33rd St | The Alameda & 33rd St | (39.3285013141, -76.5953545714) |
4 | E 33rd | The Alameda | E 33rd & The Alameda | (39.3283410623, -76.5953594625) |
The course video explains R commands to save and load the workspace. I don't think we have the equivalent for that.
The course video explains R's paste()
and paste0()
commands, which look like standard Python's string manipulations:
print ['../data' + str(i) + '.csv' for i in range(1, 6)]
['../data1.csv', '../data2.csv', '../data3.csv', '../data4.csv', '../data5.csv']
Getting data off webpages
from lxml.html import parse
url = 'http://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en'
# this is equivalent to the combined R's opening/reading/closing connection and htmlTreeParse() commands
html3 = parse(url).getroot()
# get the title text using xpath expression
# this is equivalent to R xpathSApply() command
title = html3.xpath('//title')
print [x.text_content() for x in title]
# get the texts of col-citedby elements using xpath expression
citedby = html3.xpath("//td[@id='col-citedby']")
print [x.text_content() for x in citedby]
['Jeff Leek - Google Scholar Citations'] ['Cited by', '344', '183', '147', '143', '111', '96', '87', '80', '59', '18', '11', '10', '10', '8', '8', '8', '7', '6', '5', '3']