# Homework no. 3: Pandas, data munging, and loads of fun¶

Remember, you need to import pandas before you can use it:

In [ ]:
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
#you need to press enter

In [ ]:



### Indexing of dataframes¶

In cells below, import the EdX data set from your notebooks directory as a dataframe. (It's at "./HMXPC13_DI_v2_5-14-14.csv" from your IPython notebooks)

Using pandas, slice and dice to get:

1. a dataframe with only the User Id and grade columns
2. rows 400, 500, 600, 700, 800, 900, 1000, 1100, 1200
In [ ]:


In [ ]:



### Basic statistical work¶

Load the earthquakes csv from the Foundations class using pd.read_csv.

The csv includes the labels for the columns.

Using the magnitudes of the earthquakes--the 'mag' column--calculate:

• the mean of all earthquake magnitudes

• the five earthquakes with the greatest magnitudes

• give the row number, time, magnitude and place for each

hint: use the .size() method or the value_counts() method

### Boolean indexing¶

Suppose happydataframe is a dataframe with 200 rows and two columns "activity" and "endorphin_level".

Explain briefly what is the difference between

happydataframe["activity"]="philately"



and

happydataframe["activity"]=="philalely"
In [ ]:



Using the HarvardX dataset, compute how much video (nplay_video) on average the following watched:

• men
• women from Spain
• men older than 30 from India

Use boolean indexing.

In [ ]:



Using the .groupby method create a data frame of how much video on average people from different countries of different genders watched.

something roughly like:

India

  F   10

M   20



France

  F   300

M   10

Precise formatting not at issue

In [ ]:



### Questions in re: re.sub()¶

Turn now to the files in the directory ml-100k. In the lecture, we manually converted the field names for the u.users files for our conversion into a pandas dataframe.

Using regular expressions, convert the string "user id | age | gender | occupation | zip code" into a list named labels of strings of the names of the columns. Replace any spaces within the names with underscores (_), so "zip code" will become "zip_code" &c.

In [ ]:



In the README file, find the names for the columns for u.item and u.data. Using regular expressions, parse each set of names into a list of strings of the names of the columns. Replace any spaces within the names with underscores (_).

In [ ]:



### Movie data¶

Drawing upon the two lists of labels you've just created, use pd.read_csv to load the u.item and u.data files as dataframes.

In [ ]:



Using the dataframe you've created from u.data, produce:

1. a dataframe including all the item numbers and ratings given by user 42
2. the mean of user 42's ratings
3. a dataframe including all the item numbers that user 42 gave a rating greater than his/her mean
In [ ]:



Finally, take the item numbers that user 42 gave a rating greater than his/her mean. Using the data u.item, give the titles of the movies corresponding to those item numbers.

In [ ]: