In class we computed the following:
without_discernment_boolean=ratings.mean(axis=1)>4.5
pretentious_movie_snob_boolean=ratings.mean(axis=1)<2
Can you figure of the mean ratings of movies:
What are the top ten ranked movies by title:
In class, we used cosine similarity to determine users most similar to user 1. Modify our little program to find the films most similar to a single film of your choice.
JUST FOR FUN: Only if you want, since several of you suggested it:
can you write a program that creates a new dataframe that compares every use to every other user? (or every film to every other film)?
Basically you need to add an additional for loop that what we did in class.
#you really can skip this one!
Obtain an API from the capitol words project http://capitolwords.org/api/1/. This serves more generally as an API for the Sunlight Foundation. http://sunlightfoundation.com/api/
page=1
api_key= ##PUT YOURS HERE
phrase= ## PUT YOURS HERE; MINE WAS "national+security+agency"--use "+" between words
url= "http://capitolwords.org/api/1/text.json?phrase="+phrase+"&page="+str(page)+"&apikey="+api_key
Search for the results from some phrase that interests you.
If you get a million results be more specific; if you get fewer than 50, be less specific.
Use urllib
and json.loads
to import the result into Python.
If there are more than 50 results, then you'd need to run the query again to get the full results, but with the page
variable increased by one for each set of 50. (The documentation says 100, but it appears to be wrong). Let's skip that for now!
Who is the most frequent speaker in your data set? Show how you computed it.
Finally, look at the documentation for the API at http://capitolwords.org/api/1/. Look at the boldfaced section phrases.json.
Using the example there, perform your own queries to request:
Explain briefly why count and tfidf are likely different, and give an example. (I've added a description of tfidf to our notes for Tuesday, and Wikipedia's pretty good on the top.)