Remember, you need to import pandas before you can use it:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
#you need to press enter
In cells below, import the EdX data set from your notebooks directory as a dataframe. (It's at "./HMXPC13_DI_v2_5-14-14.csv" from your IPython notebooks)
Using pandas
, slice and dice to get:
Load the earthquakes csv from the Foundations class using pd.read_csv
.
The data's at https://raw.githubusercontent.com/ledeprogram/courses/master/foundations/week_3/earthquakes.csv
The csv includes the labels for the columns.
Using the magnitudes of the earthquakes--the 'mag' column--calculate:
the mean of all earthquake magnitudes
the five earthquakes with the greatest magnitudes
hint: use the .size() method or the value_counts() method
Suppose happydataframe
is a dataframe with 200 rows and two columns "activity" and "endorphin_level".
Explain briefly what is the difference between
happydataframe["activity"]="philately"
and
happydataframe["activity"]=="philalely"
Using the HarvardX dataset, compute how much video (nplay_video
) on average the following watched:
Use boolean indexing.
Using the .groupby
method create a data frame of how much video on average people from different countries of different genders watched.
something roughly like:
India
F 10
M 20
France
F 300
M 10
Precise formatting not at issue
Turn now to the files in the directory ml-100k
. In the lecture, we manually converted the field names for the u.users files for our conversion into a pandas dataframe.
Using regular expressions, convert the string "user id | age | gender | occupation | zip code" into a list named labels
of strings of the names of the columns. Replace any spaces within the names with underscores (_), so "zip code" will become "zip_code" &c.
In the README
file, find the names for the columns for u.item
and u.data
. Using regular expressions, parse each set of names into a list
of strings of the names of the columns. Replace any spaces within the names with underscores (_).
Drawing upon the two lists of labels you've just created, use pd.read_csv to load the u.item
and u.data
files as dataframes.
Using the dataframe you've created from u.data
, produce:
Finally, take the item numbers that user 42 gave a rating greater than his/her mean. Using the data u.item
, give the titles of the movies corresponding to those item numbers.