Homework no. 3: Pandas, data munging, and loads of fun¶

Remember, you need to import pandas before you can use it:

In [ ]:

%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
#you need to press enter

In [ ]:

Indexing of dataframes¶

In cells below, import the EdX data set from your notebooks directory as a dataframe. (It's at "./HMXPC13_DI_v2_5-14-14.csv" from your IPython notebooks)

Using pandas, slice and dice to get:

a dataframe with only the User Id and grade columns
rows 400, 500, 600, 700, 800, 900, 1000, 1100, 1200

In [ ]:

Basic statistical work¶

Load the earthquakes csv from the Foundations class using pd.read_csv.

The data's at https://raw.githubusercontent.com/ledeprogram/courses/master/foundations/week_3/earthquakes.csv

The csv includes the labels for the columns.

Using the magnitudes of the earthquakes--the 'mag' column--calculate:

the mean of all earthquake magnitudes
the five earthquakes with the greatest magnitudes
- give the row number, time, magnitude and place for each
hint: use the .size() method or the value_counts() method

Boolean indexing¶

Suppose happydataframe is a dataframe with 200 rows and two columns "activity" and "endorphin_level".

Explain briefly what is the difference between

happydataframe["activity"]="philately"

and

happydataframe["activity"]=="philalely"

In [ ]:

Using the HarvardX dataset, compute how much video (nplay_video) on average the following watched:

men
women from Spain
men older than 30 from India

Use boolean indexing.

In [ ]:

Using the .groupby method create a data frame of how much video on average people from different countries of different genders watched.

something roughly like:

India

  F   10

  M   20

France

  F   300

  M   10

Precise formatting not at issue

In [ ]:

Questions in re: re.sub()¶

Turn now to the files in the directory ml-100k. In the lecture, we manually converted the field names for the u.users files for our conversion into a pandas dataframe.

Using regular expressions, convert the string "user id | age | gender | occupation | zip code" into a list named labels of strings of the names of the columns. Replace any spaces within the names with underscores (_), so "zip code" will become "zip_code" &c.

In [ ]:

In the README file, find the names for the columns for u.item and u.data. Using regular expressions, parse each set of names into a list of strings of the names of the columns. Replace any spaces within the names with underscores (_).

In [ ]:

Movie data¶

Drawing upon the two lists of labels you've just created, use pd.read_csv to load the u.item and u.data files as dataframes.

In [ ]:

Using the dataframe you've created from u.data, produce:

a dataframe including all the item numbers and ratings given by user 42
the mean of user 42's ratings
a dataframe including all the item numbers that user 42 gave a rating greater than his/her mean

In [ ]:

Finally, take the item numbers that user 42 gave a rating greater than his/her mean. Using the data u.item, give the titles of the movies corresponding to those item numbers.

In [ ]: