Number munging: vectors, Pandas, probabilities

In [ ]:
# Render our plots inline
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

pd.set_option('display.mpl_style', 'default') # Make the graphs a bit prettier
plt.rcParams['figure.figsize'] = (15, 5)
In [ ]:
#get our data--temporary home
!wget http://www.columbia.edu/~mj340/ml-100k.tar.gz
In [ ]:
!wget http://www.columbia.edu/~mj340/HMXPC13_DI_v2_5-14-14.csv.gz
In [ ]:
!gunzip HMXPC13_DI_v2_5-14-14.csv.gz
In [ ]:
!tar -zxvf ml-100k.tar.gz
In [ ]:
#check contents of directory!

Our ritual: Exploratory data analysis

Exploratory data analysis (EDA) seeks to reveal structure, or simple descriptions, in data. We look at numbers and graphs and try to find patterns.

- Persi Diaconis, "Theories of Data Analysis: From Magical Thinking Through Classical statistics"

. . . proceeding via a ‘dustbowl’ empiricism is dangerous at worst and foolish at best . . . . The purely empirical approach is particularly dangerous in an age when computers and packaged programs are readily available, since there is temptation to substitute immediate empirical analysis for more analytic thought and theory building.

- Einhorn, “Alchemy in the Behavioral Sciences,” 1972

. . . we can view the techniques of EDA as a ritual designed to reveal patters in a data set. Thus, we may believe that naturally occurring data sets contain structure, that EDA is a useful vehicle for revealing the structure. . . . If we make no attempt to check whether the structure could have arisen by chance, and tend to accept the findinds as gospel, then the ritual comes close to magical thinking. ... a controlled form of magical thinking--in the guise of 'working hypothesis'--is a basic ingredient of scientific progress.

- Persi Diaconis, "Theories of Data Analysis: From Magical Thinking Through Classical statistics"

From data to databases to data mining

  • move from accessing and manipulating data to performing ever more complicated queries on our data

Pandas first-line python tool for EDA

  • rich data structures
  • powerful ways to slice, dice, reformate, fix, and eliminate data
    • taste of what can do
  • rich queries like databases

Pandas: charismatic megafauna

In [ ]:
 
In [ ]:
CPI={"2010": 218.056, "2011": 224.939, "2012": 229.594, "2013": 232.957} #http://www.bls.gov/cpi/home.htm

The CPI provides "a measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services." A higher number means it costs more to buy the same goods. It was set to 100 in 1982-4.

We can thus use it to measure the effects of inflation on the value of houses in a toy example.

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 

Part II: Movie ratings-recommender engines

Election Mining

Campaigns are moving away from the meaningless labels of pollsters and newsweeklies — “Nascar dads” and “waitress moms” — and moving toward treating each voter as a separate person. In 2012 you didn’t just have to be an African-American from Akron or a suburban married female age 45 to 54. More and more, the information age allows people to be complicated, contradictory and unique. New technologies and an abundance of data may rattle the senses, but they are also bringing a fresh appreciation of the value of the individual to American politics.

- Ethan Roeder, “I Am Not Big Brother” http://www.nytimes.com/2012/12/06/opinion/i-am-not-big-brother.html?_r=0.
In [ ]:
 
In [ ]:
 
In [ ]:
    films=pd.read_csv('./ml-100k/u.item', sep="|", names=["movie id", "movie_title", "release_date", "video_release_date", "IMDb_URL", "unknown", "Action","Adventure", "Animation", "Children's", "Comedy", "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror", "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"])
In [ ]:
users=pd.read_csv('./ml-100k/u.user', sep="|", names=["user_id", "age", "gender","occupation","zip_code"], index_col="user_id")
In [ ]: