Homework no. 3: Pandas, data munging, and loads of fun

Remember, you need to import pandas before you can use it:

In [ ]:
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
#you need to press enter
In [ ]:
 

Indexing of dataframes

In cells below, import the EdX data set from your notebooks directory as a dataframe. (It's at "./HMXPC13_DI_v2_5-14-14.csv" from your IPython notebooks)

Using pandas, slice and dice to get:

  1. a dataframe with only the User Id and grade columns
  2. rows 400, 500, 600, 700, 800, 900, 1000, 1100, 1200
In [ ]:
 
In [ ]:
 

Basic statistical work

Load the earthquakes csv from the Foundations class using pd.read_csv.

The data's at https://raw.githubusercontent.com/ledeprogram/courses/master/foundations/week_3/earthquakes.csv

The csv includes the labels for the columns.

Using the magnitudes of the earthquakes--the 'mag' column--calculate:

  • the mean of all earthquake magnitudes

  • the five earthquakes with the greatest magnitudes

    • give the row number, time, magnitude and place for each

      hint: use the .size() method or the value_counts() method

Boolean indexing

Suppose happydataframe is a dataframe with 200 rows and two columns "activity" and "endorphin_level".

Explain briefly what is the difference between

happydataframe["activity"]="philately"

and

happydataframe["activity"]=="philalely"
In [ ]:
 

Using the HarvardX dataset, compute how much video (nplay_video) on average the following watched:

  • men
  • women from Spain
  • men older than 30 from India

Use boolean indexing.

In [ ]:
 

Using the .groupby method create a data frame of how much video on average people from different countries of different genders watched.

something roughly like:

India

  F   10

  M   20

France

  F   300

  M   10

Precise formatting not at issue

In [ ]:
 

Questions in re: re.sub()

Turn now to the files in the directory ml-100k. In the lecture, we manually converted the field names for the u.users files for our conversion into a pandas dataframe.

Using regular expressions, convert the string "user id | age | gender | occupation | zip code" into a list named labels of strings of the names of the columns. Replace any spaces within the names with underscores (_), so "zip code" will become "zip_code" &c.

In [ ]:
 

In the README file, find the names for the columns for u.item and u.data. Using regular expressions, parse each set of names into a list of strings of the names of the columns. Replace any spaces within the names with underscores (_).

In [ ]:
 

Movie data

Drawing upon the two lists of labels you've just created, use pd.read_csv to load the u.item and u.data files as dataframes.

In [ ]:
 

Using the dataframe you've created from u.data, produce:

  1. a dataframe including all the item numbers and ratings given by user 42
  2. the mean of user 42's ratings
  3. a dataframe including all the item numbers that user 42 gave a rating greater than his/her mean
In [ ]:
 

Finally, take the item numbers that user 42 gave a rating greater than his/her mean. Using the data u.item, give the titles of the movies corresponding to those item numbers.

In [ ]: