These problem sets encompass two separate topics, covered in class over the past few weeks: (1) basic SQL queries in Python and (2) data munging with Pandas.
For the first problem set, you'll be connecting to a remote PostgreSQL server and performing SQL queries inside of Python. You'll need to have either the psycopg2
library or the pg8000
library installed on your machine. In the cell below, populate the host
, username
and password
fields with the values we shared with you in class. (If you missed this, ask a fellow student for the credentials.) Make sure to run the cell before you proceed.
host = ""
username = ""
password = ""
The code in the following cell should establish a connection to the remote SQL server, regardless of whether or not you have pg8000
or pgsql
installed. The connection object will be assigned to a variable called conn
. Run this cell before you proceed. (The output of the cell should either be pg8000.core.Connection
or psycopg2.extensions.connection
, depending on which library you have installed.)
try:
import psycopg2
conn = psycopg2.connect(host=host, user=username, password=password, database="kittens")
except ImportError:
import pg8000
conn = pg8000.connect(host=host, user=username, password=password, database="kittens")
type(conn)
psycopg2.extensions.connection
Okay, now we can get to work! In the cell below, replace the expression None
with the expression that evaluates to a new cursor object, created from the connection object you established in the cell above. Assign this cursor to a variable called cursor
.
Expected output: either <pg8000.core.Cursor instance at ...>
or <cursor object at ...; closed: 0>
, depending on which library you're using. (The ...
will be a hexadecimal number like 0x12345
. The value will be unique every time you create a new cursor object.)
# your code here
Great work! You're now connected to a PostgreSQL server (running on a machine in the ~cloud~) that has a database named kittens
. In this database, there is a single table called kittens
. The schema of that table looks like this:
Table "public.kittens"
Column | Type | Modifiers
---------+-----------------------+-----------
name | character varying(80) |
age | integer |
weight | double precision |
borough | character varying(30) |
In the cell below, I've written the code necessary to make a query to the database. Fill in the value inside the quotation marks so that the output of the cell is a tuple with the values from all four columns in the first row in the database.
Expected output: ('Monsieur Whiskeurs', 5, 12.2, 'Brooklyn')
(this may be a list, not a tuple, if you're using pg8000
instead of psycopg2
.)
cursor.execute("write your query here")
cursor.fetchone()
('Monsieur Whiskeurs', 5, 12.2, 'Brooklyn')
Great work! Now, copy the cursor.execute(...)
line from the cell above and paste it into the cell below. Write the code necessary to print the following output:
Monsieur Whiskeurs is 5 years old, weighs 12.2 pounds and lives in Brooklyn
Fluffles McGee is 2 years old, weighs 8.9 pounds and lives in Queens
Gazpacho is 9 years old, weighs 11.2 pounds and lives in Brooklyn
William T. Riker is 4 years old, weighs 13.0 pounds and lives in Manhattan
Princess is 8 years old, weighs 10.2 pounds and lives in Bronx
Meowchelle, Ma Belle is 13 years old, weighs 9.1 pounds and lives in Brooklyn
(Hint: to retrieve all rows from a query, use a for
loop to iterate over the cursor object. You might want to simply print
each row first to see how the data is arranged.)
cursor.execute("write your query here")
# your code here!
Monsieur Whiskeurs is 5 years old, weighs 12.2 pounds and lives in Brooklyn Fluffles McGee is 2 years old, weighs 8.9 pounds and lives in Queens Gazpacho is 9 years old, weighs 11.2 pounds and lives in Brooklyn William T. Riker is 4 years old, weighs 13.0 pounds and lives in Manhattan Princess is 8 years old, weighs 10.2 pounds and lives in Bronx Meowchelle, Ma Belle is 13 years old, weighs 9.1 pounds and lives in Brooklyn
Okay, the hard part is over. The following problem is a test of your SQL skills. In the cell below, modify the string inside of cursor.execute(...)
so that the query returns the names of every cat from Brooklyn that weights more than 10 pounds. Write a for
loop to print these results.
Expected output:
Monsieur Whiskeurs
Gazpacho
cursor.execute("your query here")
# your code here!
Monsieur Whiskeurs Gazpacho
Remember, you need to import pandas before you can use it. Make the plots inline too!
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
#you need to press enter
In cells below, import the EduX data set we used in calss from your notebooks directory as a dataframe.
Using pandas
, slice and dice to get:
Suppose happydataframe
is a dataframe with 200 rows and two columns "activity" and "endorphin_level".
In the cell below, explain briefly what is the difference between
happydataframe["activity"]="philately"
and
happydataframe["activity"]=="philalely"
Using the EduX dataset, compute how much video (nplay_video
) on average the following watched:
Use boolean indexing.
Using the .groupby
method create a data frame of how much video on average people from different countries of different genders watched. You can pass groupby
a list of columns to group by.
something roughly like:
India
F 10
M 20
France
F 300
M 10
Precise formatting not at issue
Download a small auto mileage data set from
http://www.columbia.edu/~mj340/auto-mpg.csv. (This is a sample data set included in the Bokeh
package)
Using this auto mileage data set, find the average mpg (miles per gallon) per year for all cars and plot the values for each year. Include a title on your plot. Use panda's groupby
function and its .plot() method.
In the data base design, the manufacturer and car type are in a single string field (Ford Galaxie), rather than splitting the manufacturer (Ford) from the particular model (Galaxie). Using regular expressions or any other technique, separate the manufacturer from the mode and create two new columns for the manufacturer and the car models.
Now: