Data and Databases

Homework 4

Film Snobs

In class we computed the following:


Can you figure of the mean ratings of movies:

  1. if you eliminate those without_discernment
  2. if you eliminate the pretentious movie snobs
  3. if you eliminate both
In [ ]:

What are the top ten ranked movies by title:

  1. if you include only those without discernment
  2. if you include only the pretentious movies snobs
  3. if you exclude both
In [ ]:

Fun with vectors

In class, we used cosine similarity to determine users most similar to user 1. Modify our little program to find the films most similar to a single film of your choice.

In [1]:

JUST FOR FUN: Only if you want, since several of you suggested it:

can you write a program that creates a new dataframe that compares every use to every other user? (or every film to every other film)?

Basically you need to add an additional for loop that what we did in class.

In [ ]:
#you really can skip this one!

Capital! Capitol Words API

Obtain an API from the capitol words project This serves more generally as an API for the Sunlight Foundation.

In [1]:
api_key= ##PUT YOURS HERE
phrase= ## PUT YOURS HERE; MINE WAS "national+security+agency"--use "+" between words

url= ""+phrase+"&page="+str(page)+"&apikey="+api_key

Search for the results from some phrase that interests you.

If you get a million results be more specific; if you get fewer than 50, be less specific.

Use urllib and json.loads to import the result into Python.

If there are more than 50 results, then you'd need to run the query again to get the full results, but with the page variable increased by one for each set of 50. (The documentation says 100, but it appears to be wrong). Let's skip that for now!

In [ ]:

Who is the most frequent speaker in your data set? Show how you computed it.

In [ ]:

Finally, look at the documentation for the API at Look at the boldfaced section phrases.json.

Using the example there, perform your own queries to request:

  1. the top words in August 2011 by count.
  2. the top words in August 2011 by tfidf

Explain briefly why count and tfidf are likely different, and give an example. (I've added a description of tfidf to our notes for Tuesday, and Wikipedia's pretty good on the top.)

In [ ]: