Notebook

Data and Databases: Homework Assignment #2¶

Please refer to Homework #1 for instructions on how to complete and turn in this homework assignment.

Problem set 1: String types¶

In the cell below, I've defined a string in a variable called start. Your task is to write an expression (or series of statements) that creates a new string, which contains the character with the next ASCII value for each character in the string. For example, if you start with "abc", you should end up with "bcd": the ASCII value of a is 97, so the next ASCII character is b (98).

As a reminder: the ord() function returns the numerical ASCII value of a given character. (E.g., ord('x') evaluates to 120.) The chr() function does the converse: given an integer value, it returns the ASCII character corresponding to that value. (E.g., chr(120) evaluates to 'x')

Hint: Try writing a list comprehension with start as its source list. (Or otherwise breaking start up into a list.)

Expected output: 'beefs'

In [1]:

start = "adder"
# your code here

This next problem is perhaps significantly less tricky! Add exactly one character to the code in the cell below so that the len() function evaluates to 5 (the expected value). (Hint: You need to somehow change the string's type.)

In [2]:

my_favorite_finnish_name = "Väinö"
len(my_favorite_finnish_name)

Out[2]:

In the next cell, there is a variable surprise that contains a series of bytes. (Just trust me on this---don't worry if you don't quite understand what all of the \\x... stuff means.) These bytes contain a UTF-8 encoded string. Write an expression that converts the string of bytes into a Python Unicode string and print the result of this expression.

Expected output: Adorable cat emoji!

In [3]:

surprise = '\xf0\x9f\x98\xbb'
print #your expression here!

Problem set 2: Character encoding¶

For this problem set, you'll be working with this text file. This is a very important text file that contains the names of various important Spanish people and the cities in which they live. (Or rather: let's imagine that this is the case. In reality, I generated the text file randomly from a list of Spanish names and places.) Run the following cell to retrieve the data in a string variable called sp_data:

In [4]:

import urllib
url = "http://static.decontextualize.com/spanish_data.txt"
sp_data = urllib.urlopen(url).read()

As you can see when you print the sp_data variable, there's a problem: the text file is in a strange encoding!

In [5]:

print sp_data

Montoro,Adela,Legan�s
Ramos,Faustino,Seville
Moya,Artemio,San Crist�bal de La Laguna
Rom�n,Olga,Burgos
G�mez,Fernando,Vitoria-Gasteiz
Vargas,Pedro,Fuenlabrada
Ortiz,Columbano,Terrassa
M�ndez,Gisela,M�laga
P�rez,Abel,Zaragoza
Garc�a,Agust�n,Logro�o
Dur�n,Adalberto,Albacete
Soler,Marcial,Santa Cruz de Tenerife
Velasco,Natalia,Alicante
Gallardo,Leandro,Getafe
Dom�nguez,Faustino,M�laga
Vargas,Conrado,A Coru�a
Lozano,Mar,Albacete
Dur�n,Octavio,Alicante
Campos,Antonia,Getafe
Campos,Baldomero,Granada
Prieto,Faustino,Badajoz
Pascual,Amadeo,L'Hospitalet de Llobregat
Guerrero,Rita,Santander
V�zquez,Acacio,Jerez de la Frontera
Jim�nez,Remedios,Castell�n de la Plana

In the following cell, write an expression that evaluates to a Unicode string containing the data in sp_data. Assign this expression to a variable called sp_data_unicode so that the print statement produces the expected result.

Hint: In order to do this, you'll need to determine the character encoding of the file. You can do this in several ways: try different codec names, or try looking at the file in a web browser and adjusting the web browser's encoding settings. (In Chrome, you can do this with View > Encoding.... Make sure to set it back to "Auto-Detect" when you're done.)

Expected output: The same data as above, but with the Unicode question marks replaced by actual data. You can spot check this by looking for the line García,Agustín,Logroño.

In [6]:

sp_data_unicode = sp_data.decode('macroman')
sp_data_unicode = "" # replace "" here with your expression

Okay, you're doing great! For good measure, let's do a bit of data work. In the cell below, I've written a statement that assigns a list of lines from the sp_data_unicode variable to a list, so that lines is a list of strings, each with one line of data. Write a for loop that prints the name of every city name in the data that has more than one word. (E.g., L'Hospitalet de Llobregat should be included in the list, but Albacete should not.)

Expected output:

San Cristóbal de La Laguna
Santa Cruz de Tenerife
A Coruña
L'Hospitalet de Llobregat
Jerez de la Frontera
Castellón de la Plana

In [7]:

lines = [x for x in sp_data_unicode.split("\n") if len(x) > 0]
for item in lines:
    pass # replace "pass" with one or more of your own statements

Problem set 3: Regular Expressions¶

In the following section, we're going to do a bit of digital humanities. (I guess this could also be journalism if you were... writing an investigative piece about... early 20th century American poetry?) We'll be working with the following text, Robert Frost's The Road Not Taken:

In [8]:

poem_lines = ['Two roads diverged in a yellow wood,',
 'And sorry I could not travel both',
 'And be one traveler, long I stood',
 'And looked down one as far as I could',
 'To where it bent in the undergrowth;',
 '',
 'Then took the other, as just as fair,',
 'And having perhaps the better claim,',
 'Because it was grassy and wanted wear;',
 'Though as for that the passing there',
 'Had worn them really about the same,',
 '',
 'And both that morning equally lay',
 'In leaves no step had trodden black.',
 'Oh, I kept the first for another day!',
 'Yet knowing how way leads on to way,',
 'I doubted if I should ever come back.',
 '',
 'I shall be telling this with a sigh',
 'Somewhere ages and ages hence:',
 'Two roads diverged in a wood, and I---',
 'I took the one less travelled by,',
 'And that has made all the difference.']

In the cell above, I defined a variable poem_lines which has a list of lines in the poem.

In the cell below, write a list comprehension (using re.search()) that evaluates to a list of lines containing at least one word that has exactly eight characters. (Hint: use the \b anchor.)

Expected result:

['Two roads diverged in a yellow wood,',
 'And be one traveler, long I stood',
 'Two roads diverged in a wood, and I---']

In [9]:

import re
# your code here

Good! Now, in the following cell, write a list comprehension that evaluates to a list of lines in the poem that end with a four-letter word, regardless of whether or not there is punctuation following the word at the end of the line. (Hint: Try using the ? quantifier. Is there an existing character class, or a way to write a character class, that matches non-alphanumeric characters?)

Expected result:

['Two roads diverged in a yellow wood,',
 'And sorry I could not travel both',
 'Then took the other, as just as fair,',
 'Because it was grassy and wanted wear;',
 'Had worn them really about the same,',
 'I doubted if I should ever come back.',
 'I shall be telling this with a sigh']

In [9]:

Okay, now a slightly trickier one. In the cell below, I've created a string all_lines which evaluates to the entire text of the poem in one string. Write an expression that evaluates to all of the words in the poem that are followed by a comma. (The strings in the resulting list should not include the comma.) Hint: Use grouping!

Expected result:

['wood',
 'traveler',
 'other',
 'fair',
 'claim',
 'same',
 'Oh',
 'way',
 'wood',
 'by']

In [10]:

all_lines = " ".join(poem_lines)
# your code here

Finally, something super tricky. Here's a list of strings that contains a restaurant menu. Your job is to wrangle this plain text, slightly-structured data into a list of dictionaries.

In [11]:

entrees = [
    "1. Yam, Rosemary and Chicken Bowl with Hot Sauce - $10.95",
    "2. Lavender and Pepperoni Sandwich - $8.49",
    "3. Water Chestnuts and Peas Power Lunch (with mayonnaise) - $12.95",
    "4v. Artichoke, Mustard Green and Arugula with Sesame Oil over noodles - $9.95",
    "5. Flank Steak with Lentils And Tabasco Pepper With Sweet Chilli Sauce - $19.95",
    "6v. Rutabaga And Cucumber Wrap - $8.49"
]

You'll need to pull out the name of the dish and the price of the dish. The v after the number indicates that the dish is vegetarian---you'll need to include that information in your dictionary as well.

Expected output:

[{'name': 'Yam, Rosemary and Chicken Bowl with Hot Sauce',
  'price': '$10.95',
  'vegetarian': False},
 {'name': 'Lavender and Pepperoni Sandwich',
  'price': '$8.49',
  'vegetarian': False},
 {'name': 'Water Chestnuts and Peas Power Lunch (with mayonnaise)',
  'price': '$12.95',
  'vegetarian': False},
 {'name': 'Artichoke, Mustard Green and Arugula with Sesame Oil over noodles',
  'price': '$9.95',
  'vegetarian': True},
 {'name': 'Flank Steak with Lentils And Tabasco Pepper With Sweet Chilli Sauce',
  'price': '$19.95',
  'vegetarian': False},
 {'name': 'Rutabaga And Cucumber Wrap', 'price': '$8.49', 'vegetarian': True}]

In [13]:

You're done! Great work.

Extra credit: Write an expression below to get the name of the most expensive item on the menu.

In [13]: