Notebook

Homework 4: SQL, Pandas, data munging, and loads of fun¶

These problem sets encompass two separate topics, covered in class over the past few weeks: (1) basic SQL queries in Python and (2) data munging with Pandas.

Problem set 1: SQL¶

For the first problem set, you'll be connecting to a remote PostgreSQL server and performing SQL queries inside of Python. You'll need to have either the psycopg2 library or the pg8000 library installed on your machine. In the cell below, populate the host, username and password fields with the values we shared with you in class. (If you missed this, ask a fellow student for the credentials.) Make sure to run the cell before you proceed.

In [7]:

host = ""
username = ""
password = ""

The code in the following cell should establish a connection to the remote SQL server, regardless of whether or not you have pg8000 or pgsql installed. The connection object will be assigned to a variable called conn. Run this cell before you proceed. (The output of the cell should either be pg8000.core.Connection or psycopg2.extensions.connection, depending on which library you have installed.)

In [2]:

try:
    import psycopg2
    conn = psycopg2.connect(host=host, user=username, password=password, database="kittens")
except ImportError:
    import pg8000
    conn = pg8000.connect(host=host, user=username, password=password, database="kittens")
type(conn)

Out[2]:

psycopg2.extensions.connection

Okay, now we can get to work! In the cell below, replace the expression None with the expression that evaluates to a new cursor object, created from the connection object you established in the cell above. Assign this cursor to a variable called cursor.

Expected output: either <pg8000.core.Cursor instance at ...> or <cursor object at ...; closed: 0>, depending on which library you're using. (The ... will be a hexadecimal number like 0x12345. The value will be unique every time you create a new cursor object.)

In [8]:

# your code here

Great work! You're now connected to a PostgreSQL server (running on a machine in the ~cloud~) that has a database named kittens. In this database, there is a single table called kittens. The schema of that table looks like this:

           Table "public.kittens"
 Column  |         Type          | Modifiers 
---------+-----------------------+-----------
 name    | character varying(80) | 
 age     | integer               | 
 weight  | double precision      | 
 borough | character varying(30) |

In the cell below, I've written the code necessary to make a query to the database. Fill in the value inside the quotation marks so that the output of the cell is a tuple with the values from all four columns in the first row in the database.

Expected output: ('Monsieur Whiskeurs', 5, 12.2, 'Brooklyn') (this may be a list, not a tuple, if you're using pg8000 instead of psycopg2.)

In [4]:

cursor.execute("write your query here")
cursor.fetchone()

Out[4]:

('Monsieur Whiskeurs', 5, 12.2, 'Brooklyn')

Great work! Now, copy the cursor.execute(...) line from the cell above and paste it into the cell below. Write the code necessary to print the following output:

Monsieur Whiskeurs is 5 years old, weighs 12.2 pounds and lives in Brooklyn
Fluffles McGee is 2 years old, weighs 8.9 pounds and lives in Queens
Gazpacho is 9 years old, weighs 11.2 pounds and lives in Brooklyn
William T. Riker is 4 years old, weighs 13.0 pounds and lives in Manhattan
Princess is 8 years old, weighs 10.2 pounds and lives in Bronx
Meowchelle, Ma Belle is 13 years old, weighs 9.1 pounds and lives in Brooklyn

(Hint: to retrieve all rows from a query, use a for loop to iterate over the cursor object. You might want to simply print each row first to see how the data is arranged.)

In [5]:

cursor.execute("write your query here")
# your code here!

Monsieur Whiskeurs is 5 years old, weighs 12.2 pounds and lives in Brooklyn
Fluffles McGee is 2 years old, weighs 8.9 pounds and lives in Queens
Gazpacho is 9 years old, weighs 11.2 pounds and lives in Brooklyn
William T. Riker is 4 years old, weighs 13.0 pounds and lives in Manhattan
Princess is 8 years old, weighs 10.2 pounds and lives in Bronx
Meowchelle, Ma Belle is 13 years old, weighs 9.1 pounds and lives in Brooklyn

Okay, the hard part is over. The following problem is a test of your SQL skills. In the cell below, modify the string inside of cursor.execute(...) so that the query returns the names of every cat from Brooklyn that weights more than 10 pounds. Write a for loop to print these results.

Expected output:

Monsieur Whiskeurs
Gazpacho

In [6]:

cursor.execute("your query here")
# your code here!

Monsieur Whiskeurs
Gazpacho

Problem set 2: Pandas¶

Remember, you need to import pandas before you can use it. Make the plots inline too!

In [ ]:

%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
#you need to press enter

In [ ]:

Indexing of dataframes¶

In cells below, import the EduX data set we used in calss from your notebooks directory as a dataframe.

Using pandas, slice and dice to get:

a dataframe with only the User Id and grade columns
a dataframe with all the columns but only rows 400, 500, 600, 700, 800, 900, 1000, 1100, 1200

In [ ]:

Boolean indexing¶

Suppose happydataframe is a dataframe with 200 rows and two columns "activity" and "endorphin_level".

In the cell below, explain briefly what is the difference between

happydataframe["activity"]="philately"

and

happydataframe["activity"]=="philalely"

Using the EduX dataset, compute how much video (nplay_video) on average the following watched:

women
men from Spain
woman older than 30 from India

Use boolean indexing.

Using the .groupby method create a data frame of how much video on average people from different countries of different genders watched. You can pass groupby a list of columns to group by.

something roughly like:

India

  F   10

  M   20

France

  F   300

  M   10

Precise formatting not at issue

In [ ]:

Simple statistical work¶

Download a small auto mileage data set from

http://www.columbia.edu/~mj340/auto-mpg.csv. (This is a sample data set included in the Bokeh package)

Using this auto mileage data set, find the average mpg (miles per gallon) per year for all cars and plot the values for each year. Include a title on your plot. Use panda's groupby function and its .plot() method.

In [ ]:

In the data base design, the manufacturer and car type are in a single string field (Ford Galaxie), rather than splitting the manufacturer (Ford) from the particular model (Galaxie). Using regular expressions or any other technique, separate the manufacturer from the mode and create two new columns for the manufacturer and the car models.

Now:

Find the average mpg for each manufacturer
Plot the changing mpg for four manufacturers

In [ ]: