Number munging: vectors, Pandas, probabilities

this is the more or less raw version of what we did today. I've added a few comments here and there to eliminate confusion.

In [ ]:
 
In [1]:
# Render our plots inline
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

pd.set_option('display.mpl_style', 'default') # Make the graphs a bit prettier
plt.rcParams['figure.figsize'] = (15, 5)
In [ ]:
#get our data--temporary home--these will disappear soon
!wget http://www.columbia.edu/~mj340/ml-100k.tar.gz
In [ ]:
!wget http://www.columbia.edu/~mj340/HMXPC13_DI_v2_5-14-14.csv.gz
In [ ]:
!gunzip HMXPC13_DI_v2_5-14-14.csv.gz
In [ ]:
!tar -zxvf ml-100k.tar.gz

Our ritual: Exploratory data analysis

Exploratory data analysis (EDA) seeks to reveal structure, or simple descriptions, in data. We look at numbers and graphs and try to find patterns.

- Persi Diaconis, "Theories of Data Analysis: From Magical Thinking Through Classical statistics"

. . . proceeding via a ‘dustbowl’ empiricism is dangerous at worst and foolish at best . . . . The purely empirical approach is particularly dangerous in an age when computers and packaged programs are readily available, since there is temptation to substitute immediate empirical analysis for more analytic thought and theory building.

- Einhorn, “Alchemy in the Behavioral Sciences,” 1972

. . . we can view the techniques of EDA as a ritual designed to reveal patters in a data set. Thus, we may believe that naturally occurring data sets contain structure, that EDA is a useful vehicle for revealing the structure. . . . If we make no attempt to check whether the structure could have arisen by chance, and tend to accept the findinds as gospel, then the ritual comes close to magical thinking. ... a controlled form of magical thinking--in the guise of 'working hypothesis'--is a basic ingredient of scientific progress.

- Persi Diaconis, "Theories of Data Analysis: From Magical Thinking Through Classical statistics"

From data to databases to data mining

  • move from accessing and manipulating data to performing ever more complicated queries on our data

Pandas first-line python tool for EDA

  • rich data structures
  • powerful ways to slice, dice, reformate, fix, and eliminate data
    • taste of what can do
  • rich queries like databases

Pandas: charismatic megafauna

In [3]:
import pandas as pd
In [5]:
CPI={"2010": 218.056, "2011": 224.939, "2012": 229.594, "2013": 232.957} #http://www.bls.gov/cpi/home.htm
In [6]:
CPI["2011"]
Out[6]:
224.939

The CPI provides "a measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services." A higher number means it costs more to buy the same goods. It was set to 100 in 1982-4.

We can thus use it to measure the effects of inflation on the value of houses in a toy example.

In [7]:
CPI_series=pd.Series(CPI)
In [8]:
CPI_series
Out[8]:
2010    218.056
2011    224.939
2012    229.594
2013    232.957
dtype: float64
In [9]:
House_sale_mean={"2010":100000, "2011":100000, "2012":100000, "2013":100000}
In [10]:
house_sale_series=pd.Series(House_sale_mean)
In [11]:
house_sale_series
Out[11]:
2010    100000
2011    100000
2012    100000
2013    100000
dtype: int64
In [13]:
(house_sale_series/CPI_series)
Out[13]:
2010    458.597791
2011    444.564971
2012    435.551452
2013    429.263770
dtype: float64
In [15]:
(house_sale_series/CPI_series)*100
Out[15]:
2010    45859.779139
2011    44456.497095
2012    43555.145169
2013    42926.376971
dtype: float64
In [16]:
inflation_adjusted=(house_sale_series/CPI_series)*100
In [18]:
inflation_adjusted.plot(title="Sorry Kids! Blame X, where X is the guy in office")
Out[18]:
<matplotlib.axes.AxesSubplot at 0x7fd6358056d0>
/usr/lib/pymodules/python2.7/matplotlib/font_manager.py:1246: UserWarning: findfont: Could not match :family=Bitstream Vera Sans:style=normal:variant=normal:weight=normal:stretch=normal:size=x-large. Returning /usr/share/matplotlib/mpl-data/fonts/ttf/cmb10.ttf
  UserWarning)

Dataframes

In [19]:
df=pd.read_csv('./HMXPC13_DI_v2_5-14-14.csv', sep=',')

Note that df is the name of our datastructure, not the function to make something into a dataframe, which is pd.DataFrame. I failed to explain this accurately in class. So df["blah"] is indexing on the particular dataframe df, not a function df on blah.

In [20]:
df
Out[20]:
course_id userid_DI registered viewed explored certified final_cc_cname_DI LoE_DI YoB gender grade start_time_DI last_event_DI nevents ndays_act nplay_video nchapters nforum_posts roles incomplete_flag
0 HarvardX/CB22x/2013_Spring MHxPC130442623 1 0 0 0 United States NaN NaN NaN 0 2012-12-19 2013-11-17 NaN 9 NaN NaN 0 NaN 1
1 HarvardX/CS50x/2012 MHxPC130442623 1 1 0 0 United States NaN NaN NaN 0 2012-10-15 NaN NaN 9 NaN 1 0 NaN 1
2 HarvardX/CB22x/2013_Spring MHxPC130275857 1 0 0 0 United States NaN NaN NaN 0 2013-02-08 2013-11-17 NaN 16 NaN NaN 0 NaN 1
3 HarvardX/CS50x/2012 MHxPC130275857 1 0 0 0 United States NaN NaN NaN 0 2012-09-17 NaN NaN 16 NaN NaN 0 NaN 1
4 HarvardX/ER22x/2013_Spring MHxPC130275857 1 0 0 0 United States NaN NaN NaN 0 2012-12-19 NaN NaN 16 NaN NaN 0 NaN 1
5 HarvardX/PH207x/2012_Fall MHxPC130275857 1 1 1 0 United States NaN NaN NaN 0 2012-09-17 2013-05-23 502 16 50 12 0 NaN NaN
6 HarvardX/PH278x/2013_Spring MHxPC130275857 1 0 0 0 United States NaN NaN NaN 0 2013-02-08 NaN NaN 16 NaN NaN 0 NaN 1
7 HarvardX/CB22x/2013_Spring MHxPC130539455 1 1 0 0 France NaN NaN NaN 0 2013-01-01 2013-05-14 42 6 NaN 3 0 NaN NaN
8 HarvardX/CB22x/2013_Spring MHxPC130088379 1 1 0 0 United States NaN NaN NaN 0 2013-02-18 2013-03-17 70 3 NaN 3 0 NaN NaN
9 HarvardX/CS50x/2012 MHxPC130088379 1 1 0 0 United States NaN NaN NaN 0 2012-10-20 NaN NaN 12 NaN 3 0 NaN 1
10 HarvardX/ER22x/2013_Spring MHxPC130088379 1 1 0 0 United States NaN NaN NaN 0 2013-02-23 2013-06-14 17 2 NaN 2 0 NaN NaN
11 HarvardX/ER22x/2013_Spring MHxPC130198098 1 1 0 0 United States NaN NaN NaN 0 2013-06-17 2013-06-17 32 1 NaN 3 0 NaN NaN
12 HarvardX/CB22x/2013_Spring MHxPC130024894 1 1 0 0 United States NaN NaN NaN 0.07 2013-01-24 2013-08-03 175 9 NaN 7 0 NaN NaN
13 HarvardX/CS50x/2012 MHxPC130024894 1 1 0 0 United States NaN NaN NaN 0 2013-06-27 NaN NaN 2 NaN 2 0 NaN 1
14 HarvardX/ER22x/2013_Spring MHxPC130024894 1 1 0 0 United States NaN NaN NaN 0 2012-12-19 2013-08-17 78 5 NaN 4 0 NaN NaN
15 HarvardX/PH207x/2012_Fall MHxPC130024894 1 1 0 0 United States NaN NaN NaN 0 2012-07-26 2013-01-16 75 14 5 2 0 NaN NaN
16 HarvardX/PH278x/2013_Spring MHxPC130024894 1 1 0 0 United States NaN NaN NaN 0 2013-07-30 2013-08-27 11 2 2 1 0 NaN NaN
17 HarvardX/CS50x/2012 MHxPC130080986 1 1 0 0 United States NaN NaN NaN 0 2012-10-15 NaN NaN 11 NaN 1 0 NaN 1
18 HarvardX/PH207x/2012_Fall MHxPC130080986 1 1 0 0 United States NaN NaN NaN 0 2012-10-25 2012-12-04 56 11 1 2 1 NaN NaN
19 HarvardX/CS50x/2012 MHxPC130063375 1 1 0 0 Unknown/Other NaN NaN NaN 0 2012-10-19 NaN NaN NaN NaN 1 0 NaN 1
20 HarvardX/CS50x/2012 MHxPC130094371 1 1 0 0 United States NaN NaN NaN 0 2013-03-03 2013-03-03 7 1 NaN 2 0 NaN NaN
21 HarvardX/CS50x/2012 MHxPC130229084 1 1 0 0 Mexico NaN NaN NaN 0 2012-10-15 NaN NaN NaN NaN 1 0 NaN 1
22 HarvardX/CS50x/2012 MHxPC130300925 1 1 0 0 United States NaN NaN NaN 0 2012-10-24 NaN NaN 2 NaN 1 0 NaN 1
23 HarvardX/ER22x/2013_Spring MHxPC130300925 1 1 0 0 United States NaN NaN NaN 0 2012-12-20 2013-05-18 15 2 NaN 2 0 NaN NaN
24 HarvardX/CS50x/2012 MHxPC130417650 1 1 0 0 Australia NaN NaN NaN 0 2012-10-29 2013-03-04 1 1 NaN 2 0 NaN NaN
25 HarvardX/CS50x/2012 MHxPC130506580 1 0 0 0 United States NaN NaN NaN 0 2012-09-04 NaN NaN NaN NaN NaN 0 NaN NaN
26 HarvardX/CS50x/2012 MHxPC130298257 1 0 0 0 United States NaN NaN NaN 0 2012-09-05 NaN NaN NaN NaN 3 0 NaN 1
27 HarvardX/CS50x/2012 MHxPC130500569 1 1 0 0 United States NaN NaN NaN 0 2012-10-22 2013-03-30 6 1 NaN 5 0 NaN NaN
28 HarvardX/CS50x/2012 MHxPC130466479 1 1 0 0 Unknown/Other NaN NaN NaN 0 2013-01-07 NaN NaN NaN NaN 1 0 NaN 1
29 HarvardX/CB22x/2013_Spring MHxPC130340959 1 1 0 0 United States NaN NaN NaN 0.05 2013-02-11 2013-04-06 285 8 NaN 4 0 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
641108 MITx/6.002x/2013_Spring MHxPC130140735 1 1 0 0 United States Bachelor's 1991 m NaN 2013-09-07 2013-09-07 59 1 5 3 0 NaN NaN
641109 MITx/6.00x/2013_Spring MHxPC130493130 1 0 0 0 United Kingdom Master's 1977 m NaN 2013-09-07 NaN NaN NaN NaN 2 0 NaN 1
641110 MITx/6.00x/2013_Spring MHxPC130400592 1 1 0 0 Other Europe Secondary 1992 m NaN 2013-09-07 2013-09-07 395 1 51 4 0 NaN NaN
641111 MITx/6.00x/2013_Spring MHxPC130109892 1 1 0 0 India Secondary 1995 m NaN 2013-09-07 2013-09-07 49 1 14 2 0 NaN NaN
641112 MITx/14.73x/2013_Spring MHxPC130183007 1 0 0 0 India Master's 1985 m NaN 2013-09-07 NaN NaN NaN NaN NaN 0 NaN NaN
641113 MITx/8.MReV/2013_Summer MHxPC130261281 1 1 0 0 India Secondary 1994 m 0 2013-09-07 2013-09-07 8 1 NaN 1 0 NaN NaN
641114 MITx/6.00x/2013_Spring MHxPC130481990 1 1 0 0 India Bachelor's 1989 m NaN 2013-09-07 2013-09-07 22 1 5 1 0 NaN NaN
641115 MITx/6.00x/2013_Spring MHxPC130528581 1 0 0 0 United States Bachelor's 1990 f NaN 2013-09-07 2013-09-07 2 1 NaN 3 0 NaN NaN
641116 MITx/14.73x/2013_Spring MHxPC130555418 1 0 0 0 Unknown/Other Bachelor's 1988 m NaN 2013-09-07 NaN NaN NaN NaN NaN 0 NaN NaN
641117 MITx/6.002x/2013_Spring MHxPC130408810 1 0 0 0 India Secondary 1993 m NaN 2013-09-07 2013-09-07 2 1 NaN 3 0 NaN NaN
641118 MITx/6.00x/2013_Spring MHxPC130040184 1 0 0 0 United States Secondary 1991 m NaN 2013-09-07 NaN NaN NaN NaN NaN 0 NaN NaN
641119 MITx/6.002x/2013_Spring MHxPC130566049 1 0 0 0 Other Europe Master's 1982 m NaN 2013-09-07 2013-09-07 2 1 NaN 2 0 NaN NaN
641120 MITx/8.MReV/2013_Summer MHxPC130374105 1 1 0 0 India Bachelor's 1992 m 0 2013-09-07 2013-09-07 49 1 NaN 1 0 NaN NaN
641121 MITx/6.00x/2013_Spring MHxPC130282999 1 0 0 0 Other Europe Master's 1979 m NaN 2013-09-07 NaN NaN NaN NaN 7 0 NaN 1
641122 MITx/8.MReV/2013_Summer MHxPC130556398 1 0 0 0 India Bachelor's 1985 m 0 2013-09-07 2013-09-07 1 1 NaN NaN 0 NaN NaN
641123 MITx/6.00x/2013_Spring MHxPC130573334 1 0 0 0 Spain Bachelor's 1989 m NaN 2013-09-07 2013-09-07 1 1 NaN NaN 0 NaN NaN
641124 MITx/6.00x/2013_Spring MHxPC130505931 1 1 0 0 India Secondary 1995 m NaN 2013-09-07 2013-09-07 59 1 NaN 2 0 NaN NaN
641125 MITx/6.002x/2013_Spring MHxPC130280976 1 0 0 0 United States Bachelor's NaN m NaN 2013-09-07 2013-09-07 2 1 NaN NaN 0 NaN NaN
641126 MITx/6.00x/2013_Spring MHxPC130137331 1 1 0 0 United States Secondary 1992 m NaN 2013-09-07 2013-09-07 251 1 77 4 0 NaN NaN
641127 MITx/6.002x/2013_Spring MHxPC130271624 1 0 0 0 India Bachelor's 1989 m NaN 2013-09-07 2013-09-07 1 1 NaN NaN 0 NaN NaN
641128 MITx/14.73x/2013_Spring MHxPC130256541 1 1 0 0 United States Master's 1982 m NaN 2013-09-07 2013-09-07 51 1 1 1 0 NaN NaN
641129 MITx/6.00x/2013_Spring MHxPC130021638 1 0 0 0 Unknown/Other Bachelor's 1988 m NaN 2013-09-07 NaN NaN NaN NaN NaN 0 NaN NaN
641130 MITx/14.73x/2013_Spring MHxPC130591057 1 0 0 0 Canada Bachelor's NaN f NaN 2013-09-07 2013-09-07 6 1 NaN NaN 0 NaN NaN
641131 MITx/8.02x/2013_Spring MHxPC130226305 1 0 0 0 Unknown/Other Bachelor's 1988 m NaN 2013-09-07 2013-09-07 11 1 NaN 2 0 NaN NaN
641132 MITx/6.002x/2013_Spring MHxPC130030805 1 1 0 0 Pakistan Master's 1989 m NaN 2013-09-07 2013-09-07 29 1 NaN 1 0 NaN NaN
641133 MITx/6.00x/2013_Spring MHxPC130184108 1 1 0 0 Canada Bachelor's 1991 m NaN 2013-09-07 2013-09-07 97 1 4 2 0 NaN NaN
641134 MITx/6.00x/2013_Spring MHxPC130359782 1 0 0 0 Other Europe Bachelor's 1991 f NaN 2013-09-07 2013-09-07 1 1 NaN NaN 0 NaN NaN
641135 MITx/6.002x/2013_Spring MHxPC130098513 1 0 0 0 United States Doctorate 1979 m NaN 2013-09-07 2013-09-07 1 1 NaN NaN 0 NaN NaN
641136 MITx/6.00x/2013_Spring MHxPC130098513 1 1 0 0 United States Doctorate 1979 m NaN 2013-09-07 2013-09-07 74 1 14 1 0 NaN NaN
641137 MITx/8.02x/2013_Spring MHxPC130098513 1 0 0 0 United States Doctorate 1979 m NaN 2013-09-07 NaN NaN 1 NaN NaN 0 NaN 1

641138 rows × 20 columns

In [21]:
df["course_id"] #this evaluates to a Series
Out[21]:
0      HarvardX/CB22x/2013_Spring
1             HarvardX/CS50x/2012
2      HarvardX/CB22x/2013_Spring
3             HarvardX/CS50x/2012
4      HarvardX/ER22x/2013_Spring
5       HarvardX/PH207x/2012_Fall
6     HarvardX/PH278x/2013_Spring
7      HarvardX/CB22x/2013_Spring
8      HarvardX/CB22x/2013_Spring
9             HarvardX/CS50x/2012
10     HarvardX/ER22x/2013_Spring
11     HarvardX/ER22x/2013_Spring
12     HarvardX/CB22x/2013_Spring
13            HarvardX/CS50x/2012
14     HarvardX/ER22x/2013_Spring
...
641123     MITx/6.00x/2013_Spring
641124     MITx/6.00x/2013_Spring
641125    MITx/6.002x/2013_Spring
641126     MITx/6.00x/2013_Spring
641127    MITx/6.002x/2013_Spring
641128    MITx/14.73x/2013_Spring
641129     MITx/6.00x/2013_Spring
641130    MITx/14.73x/2013_Spring
641131     MITx/8.02x/2013_Spring
641132    MITx/6.002x/2013_Spring
641133     MITx/6.00x/2013_Spring
641134     MITx/6.00x/2013_Spring
641135    MITx/6.002x/2013_Spring
641136     MITx/6.00x/2013_Spring
641137     MITx/8.02x/2013_Spring
Name: course_id, Length: 641138, dtype: object
In [25]:
df.ix[100,16] #this picks out a particular value at [row, column]
Out[25]:
nan
In [29]:
df["course_id"][2240:2250]
Out[29]:
2240           HarvardX/CS50x/2012
2241    HarvardX/ER22x/2013_Spring
2242     HarvardX/PH207x/2012_Fall
2243     HarvardX/PH207x/2012_Fall
2244           HarvardX/CS50x/2012
2245    HarvardX/CB22x/2013_Spring
2246    HarvardX/ER22x/2013_Spring
2247           HarvardX/CS50x/2012
2248           HarvardX/CS50x/2012
2249    HarvardX/CB22x/2013_Spring
Name: course_id, dtype: object
In [30]:
df[2240:2250]
Out[30]:
course_id userid_DI registered viewed explored certified final_cc_cname_DI LoE_DI YoB gender grade start_time_DI last_event_DI nevents ndays_act nplay_video nchapters nforum_posts roles incomplete_flag
2240 HarvardX/CS50x/2012 MHxPC130136599 1 0 0 0 China NaN NaN NaN 0.0 2012-07-26 NaN NaN NaN NaN NaN 0 NaN NaN
2241 HarvardX/ER22x/2013_Spring MHxPC130024795 1 1 0 0 Other South Asia NaN NaN NaN NaN 2013-02-06 2013-03-20 25 1 NaN 2 0 NaN NaN
2242 HarvardX/PH207x/2012_Fall MHxPC130024795 1 1 1 1 Other South Asia NaN NaN NaN 0.91 2012-10-07 2013-04-23 8066 32 917 15 0 NaN NaN
2243 HarvardX/PH207x/2012_Fall MHxPC130524812 1 1 0 0 Other Africa NaN NaN NaN 0 2012-08-17 2012-10-21 403 3 90 2 0 NaN NaN
2244 HarvardX/CS50x/2012 MHxPC130493694 1 0 0 0 United States NaN NaN NaN 0 2012-08-17 NaN NaN NaN NaN NaN 0 NaN NaN
2245 HarvardX/CB22x/2013_Spring MHxPC130527617 1 1 0 0 Germany NaN NaN NaN 0 2013-01-22 2013-03-14 6 2 NaN 1 0 NaN NaN
2246 HarvardX/ER22x/2013_Spring MHxPC130527617 1 0 0 0 Germany NaN NaN NaN 0 2013-01-22 NaN NaN 2 NaN NaN 0 NaN 1
2247 HarvardX/CS50x/2012 MHxPC130156189 1 0 0 0 India NaN NaN NaN 0 2012-08-17 NaN NaN NaN NaN NaN 0 NaN NaN
2248 HarvardX/CS50x/2012 MHxPC130593566 1 1 0 0 India NaN NaN NaN 0 2012-07-25 2013-05-22 1 1 NaN 3 0 NaN NaN
2249 HarvardX/CB22x/2013_Spring MHxPC130404169 1 0 0 0 Canada NaN NaN NaN 0 2013-02-14 2013-03-14 4 2 NaN NaN 0 NaN NaN
In [31]:
df[666] #error on purpose!
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-31-342ddc6d1cab> in <module>()
----> 1 df[666]

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1682             return self._getitem_multilevel(key)
   1683         else:
-> 1684             return self._getitem_column(key)
   1685 
   1686     def _getitem_column(self, key):

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _getitem_column(self, key)
   1689         # get column
   1690         if self.columns.is_unique:
-> 1691             return self._get_item_cache(key)
   1692 
   1693         # duplicate columns & possible reduce dimensionaility

/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
   1050         res = cache.get(item)
   1051         if res is None:
-> 1052             values = self._data.get(item)
   1053             res = self._box_item_values(item, values)
   1054             cache[item] = res

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in get(self, item)
   2535 
   2536             if not isnull(item):
-> 2537                 loc = self.items.get_loc(item)
   2538             else:
   2539                 indexer = np.arange(len(self.items))[isnull(self.items)]

/usr/local/lib/python2.7/dist-packages/pandas/core/index.pyc in get_loc(self, key)
   1154         loc : int if unique index, possibly slice or mask if not
   1155         """
-> 1156         return self._engine.get_loc(_values_from_object(key))
   1157 
   1158     def get_value(self, series, key):

/usr/local/lib/python2.7/dist-packages/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3353)()

/usr/local/lib/python2.7/dist-packages/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3233)()

/usr/local/lib/python2.7/dist-packages/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11148)()

/usr/local/lib/python2.7/dist-packages/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11101)()

KeyError: 666
In [37]:
df.ix[666] #evaluates to series
Out[37]:
course_id            HarvardX/CS50x/2012
userid_DI                 MHxPC130297337
registered                             1
viewed                                 0
explored                               0
certified                              0
final_cc_cname_DI         United Kingdom
LoE_DI                               NaN
YoB                                  NaN
gender                               NaN
grade                                  0
start_time_DI                 2012-08-17
last_event_DI                        NaN
nevents                              NaN
ndays_act                            NaN
nplay_video                          NaN
nchapters                            NaN
nforum_posts                           0
roles                                NaN
incomplete_flag                      NaN
Name: 666, dtype: object
In [38]:
df.ix[[666]] #evaluates to data frame
Out[38]:
course_id userid_DI registered viewed explored certified final_cc_cname_DI LoE_DI YoB gender grade start_time_DI last_event_DI nevents ndays_act nplay_video nchapters nforum_posts roles incomplete_flag
666 HarvardX/CS50x/2012 MHxPC130297337 1 0 0 0 United Kingdom NaN NaN NaN 0 2012-08-17 NaN NaN NaN NaN NaN 0 NaN NaN
In [41]:
df[["gender", "grade"]][1781:1787].mean()
Out[41]:
gender   NaN
grade      0
dtype: float64
In [47]:
country_series=df["final_cc_cname_DI"]
In [56]:
country_series=="France"
Out[56]:
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7      True
8     False
9     False
10    False
11    False
12    False
13    False
14    False
...
641123    False
641124    False
641125    False
641126    False
641127    False
641128    False
641129    False
641130    False
641131    False
641132    False
641133    False
641134    False
641135    False
641136    False
641137    False
Name: final_cc_cname_DI, Length: 641138, dtype: bool
In [57]:
country_france_boolean_series=(country_series=='France')
In [58]:
country_france_boolean_series
Out[58]:
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7      True
8     False
9     False
10    False
11    False
12    False
13    False
14    False
...
641123    False
641124    False
641125    False
641126    False
641127    False
641128    False
641129    False
641130    False
641131    False
641132    False
641133    False
641134    False
641135    False
641136    False
641137    False
Name: final_cc_cname_DI, Length: 641138, dtype: bool
In [61]:
df['nchapters']
Out[61]:
0    NaN
1      1
2    NaN
3    NaN
4    NaN
5     12
6    NaN
7      3
8      3
9      3
10     2
11     3
12     7
13     2
14     4
...
641123   NaN
641124     2
641125   NaN
641126     4
641127   NaN
641128     1
641129   NaN
641130   NaN
641131     2
641132     1
641133     2
641134   NaN
641135   NaN
641136     1
641137   NaN
Name: nchapters, Length: 641138, dtype: float64
In [62]:
df['nchapters']>0
Out[62]:
0     False
1      True
2     False
3     False
4     False
5      True
6     False
7      True
8      True
9      True
10     True
11     True
12     True
13     True
14     True
...
641123    False
641124     True
641125    False
641126     True
641127    False
641128     True
641129    False
641130    False
641131     True
641132     True
641133     True
641134    False
641135    False
641136     True
641137    False
Name: nchapters, Length: 641138, dtype: bool
In [63]:
nchapters_watched_series=(df['nchapters']>0)
In [64]:
df[nchapters_watched_series]
Out[64]:
course_id userid_DI registered viewed explored certified final_cc_cname_DI LoE_DI YoB gender grade start_time_DI last_event_DI nevents ndays_act nplay_video nchapters nforum_posts roles incomplete_flag
1 HarvardX/CS50x/2012 MHxPC130442623 1 1 0 0 United States NaN NaN NaN 0 2012-10-15 NaN NaN 9 NaN 1 0 NaN 1
5 HarvardX/PH207x/2012_Fall MHxPC130275857 1 1 1 0 United States NaN NaN NaN 0 2012-09-17 2013-05-23 502 16 50 12 0 NaN NaN
7 HarvardX/CB22x/2013_Spring MHxPC130539455 1 1 0 0 France NaN NaN NaN 0 2013-01-01 2013-05-14 42 6 NaN 3 0 NaN NaN
8 HarvardX/CB22x/2013_Spring MHxPC130088379 1 1 0 0 United States NaN NaN NaN 0 2013-02-18 2013-03-17 70 3 NaN 3 0 NaN NaN
9 HarvardX/CS50x/2012 MHxPC130088379 1 1 0 0 United States NaN NaN NaN 0 2012-10-20 NaN NaN 12 NaN 3 0 NaN 1
10 HarvardX/ER22x/2013_Spring MHxPC130088379 1 1 0 0 United States NaN NaN NaN 0 2013-02-23 2013-06-14 17 2 NaN 2 0 NaN NaN
11 HarvardX/ER22x/2013_Spring MHxPC130198098 1 1 0 0 United States NaN NaN NaN 0 2013-06-17 2013-06-17 32 1 NaN 3 0 NaN NaN
12 HarvardX/CB22x/2013_Spring MHxPC130024894 1 1 0 0 United States NaN NaN NaN 0.07 2013-01-24 2013-08-03 175 9 NaN 7 0 NaN NaN
13 HarvardX/CS50x/2012 MHxPC130024894 1 1 0 0 United States NaN NaN NaN 0 2013-06-27 NaN NaN 2 NaN 2 0 NaN 1
14 HarvardX/ER22x/2013_Spring MHxPC130024894 1 1 0 0 United States NaN NaN NaN 0 2012-12-19 2013-08-17 78 5 NaN 4 0 NaN NaN
15 HarvardX/PH207x/2012_Fall MHxPC130024894 1 1 0 0 United States NaN NaN NaN 0 2012-07-26 2013-01-16 75 14 5 2 0 NaN NaN
16 HarvardX/PH278x/2013_Spring MHxPC130024894 1 1 0 0 United States NaN NaN NaN 0 2013-07-30 2013-08-27 11 2 2 1 0 NaN NaN
17 HarvardX/CS50x/2012 MHxPC130080986 1 1 0 0 United States NaN NaN NaN 0 2012-10-15 NaN NaN 11 NaN 1 0 NaN 1
18 HarvardX/PH207x/2012_Fall MHxPC130080986 1 1 0 0 United States NaN NaN NaN 0 2012-10-25 2012-12-04 56 11 1 2 1 NaN NaN
19 HarvardX/CS50x/2012 MHxPC130063375 1 1 0 0 Unknown/Other NaN NaN NaN 0 2012-10-19 NaN NaN NaN NaN 1 0 NaN 1
20 HarvardX/CS50x/2012 MHxPC130094371 1 1 0 0 United States NaN NaN NaN 0 2013-03-03 2013-03-03 7 1 NaN 2 0 NaN NaN
21 HarvardX/CS50x/2012 MHxPC130229084 1 1 0 0 Mexico NaN NaN NaN 0 2012-10-15 NaN NaN NaN NaN 1 0 NaN 1
22 HarvardX/CS50x/2012 MHxPC130300925 1 1 0 0 United States NaN NaN NaN 0 2012-10-24 NaN NaN 2 NaN 1 0 NaN 1
23 HarvardX/ER22x/2013_Spring MHxPC130300925 1 1 0 0 United States NaN NaN NaN 0 2012-12-20 2013-05-18 15 2 NaN 2 0 NaN NaN
24 HarvardX/CS50x/2012 MHxPC130417650 1 1 0 0 Australia NaN NaN NaN 0 2012-10-29 2013-03-04 1 1 NaN 2 0 NaN NaN
26 HarvardX/CS50x/2012 MHxPC130298257 1 0 0 0 United States NaN NaN NaN 0 2012-09-05 NaN NaN NaN NaN 3 0 NaN 1
27 HarvardX/CS50x/2012 MHxPC130500569 1 1 0 0 United States NaN NaN NaN 0 2012-10-22 2013-03-30 6 1 NaN 5 0 NaN NaN
28 HarvardX/CS50x/2012 MHxPC130466479 1 1 0 0 Unknown/Other NaN NaN NaN 0 2013-01-07 NaN NaN NaN NaN 1 0 NaN 1
29 HarvardX/CB22x/2013_Spring MHxPC130340959 1 1 0 0 United States NaN NaN NaN 0.05 2013-02-11 2013-04-06 285 8 NaN 4 0 NaN NaN
33 HarvardX/CS50x/2012 MHxPC130356280 1 1 0 0 India NaN NaN NaN 0 2012-09-27 2013-03-31 3 1 NaN 1 0 NaN NaN
34 HarvardX/CS50x/2012 MHxPC130328890 1 1 0 0 Australia NaN NaN NaN 0 2012-12-10 2013-02-27 2 1 NaN 5 0 NaN NaN
36 HarvardX/CB22x/2013_Spring MHxPC130435030 1 1 0 0 Canada NaN NaN NaN 0 2013-02-20 2013-06-29 80 5 NaN 2 0 NaN NaN
37 HarvardX/CS50x/2012 MHxPC130435030 1 1 0 0 Canada NaN NaN NaN 0 2012-10-13 NaN NaN 1 NaN 1 0 NaN 1
41 HarvardX/CB22x/2013_Spring MHxPC130542822 1 1 0 0 United States NaN NaN NaN NaN 2012-12-20 2013-03-14 21 3 NaN 1 0 NaN NaN
43 HarvardX/ER22x/2013_Spring MHxPC130069044 1 0 0 0 United States NaN NaN NaN 0 2013-04-22 2013-04-23 1 1 NaN 2 0 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
641083 MITx/6.00x/2013_Spring MHxPC130238153 1 1 0 0 United States Secondary 1988 m NaN 2013-09-07 2013-09-07 116 1 20 2 0 NaN NaN
641084 MITx/6.002x/2013_Spring MHxPC130544641 1 1 0 0 India Bachelor's 1992 m NaN 2013-09-07 2013-09-07 245 1 56 1 0 NaN NaN
641085 MITx/8.02x/2013_Spring MHxPC130117789 1 1 0 0 India Less than Secondary 1996 m NaN 2013-09-07 2013-09-07 169 1 45 3 0 NaN NaN
641086 MITx/14.73x/2013_Spring MHxPC130122763 1 1 0 0 India Master's 1987 m NaN 2013-09-07 2013-09-07 3 1 NaN 1 0 NaN NaN
641088 MITx/6.00x/2013_Spring MHxPC130214187 1 0 0 0 India Bachelor's 1984 m NaN 2013-09-07 2013-09-07 1 1 NaN 3 0 NaN NaN
641089 MITx/6.002x/2013_Spring MHxPC130145710 1 1 0 0 India Bachelor's 1989 m NaN 2013-09-07 2013-09-07 82 1 16 5 0 NaN NaN
641099 MITx/6.00x/2013_Spring MHxPC130455967 1 1 0 0 United States Bachelor's 1988 m NaN 2013-09-07 2013-09-07 242 1 22 5 0 NaN NaN
641100 MITx/6.002x/2013_Spring MHxPC130040345 1 1 0 0 India Secondary 1993 f NaN 2013-09-07 2013-09-07 94 1 7 2 0 NaN NaN
641101 MITx/6.00x/2013_Spring MHxPC130298117 1 0 0 0 Russian Federation Less than Secondary 1997 m NaN 2013-09-07 NaN NaN NaN NaN 1 0 NaN 1
641102 MITx/6.00x/2013_Spring MHxPC130024301 1 1 0 0 Morocco Secondary 1994 m NaN 2013-09-07 2013-09-07 25 1 NaN 2 0 NaN NaN
641103 MITx/6.00x/2013_Spring MHxPC130097716 1 1 0 0 India Bachelor's 1981 m NaN 2013-09-07 2013-09-07 11 1 2 2 0 NaN NaN
641107 MITx/8.02x/2013_Spring MHxPC130347356 1 1 0 0 India Secondary 1994 m NaN 2013-09-07 2013-09-07 153 1 31 2 0 NaN NaN
641108 MITx/6.002x/2013_Spring MHxPC130140735 1 1 0 0 United States Bachelor's 1991 m NaN 2013-09-07 2013-09-07 59 1 5 3 0 NaN NaN
641109 MITx/6.00x/2013_Spring MHxPC130493130 1 0 0 0 United Kingdom Master's 1977 m NaN 2013-09-07 NaN NaN NaN NaN 2 0 NaN 1
641110 MITx/6.00x/2013_Spring MHxPC130400592 1 1 0 0 Other Europe Secondary 1992 m NaN 2013-09-07 2013-09-07 395 1 51 4 0 NaN NaN
641111 MITx/6.00x/2013_Spring MHxPC130109892 1 1 0 0 India Secondary 1995 m NaN 2013-09-07 2013-09-07 49 1 14 2 0 NaN NaN
641113 MITx/8.MReV/2013_Summer MHxPC130261281 1 1 0 0 India Secondary 1994 m 0 2013-09-07 2013-09-07 8 1 NaN 1 0 NaN NaN
641114 MITx/6.00x/2013_Spring MHxPC130481990 1 1 0 0 India Bachelor's 1989 m NaN 2013-09-07 2013-09-07 22 1 5 1 0 NaN NaN
641115 MITx/6.00x/2013_Spring MHxPC130528581 1 0 0 0 United States Bachelor's 1990 f NaN 2013-09-07 2013-09-07 2 1 NaN 3 0 NaN NaN
641117 MITx/6.002x/2013_Spring MHxPC130408810 1 0 0 0 India Secondary 1993 m NaN 2013-09-07 2013-09-07 2 1 NaN 3 0 NaN NaN
641119 MITx/6.002x/2013_Spring MHxPC130566049 1 0 0 0 Other Europe Master's 1982 m NaN 2013-09-07 2013-09-07 2 1 NaN 2 0 NaN NaN
641120 MITx/8.MReV/2013_Summer MHxPC130374105 1 1 0 0 India Bachelor's 1992 m 0 2013-09-07 2013-09-07 49 1 NaN 1 0 NaN NaN
641121 MITx/6.00x/2013_Spring MHxPC130282999 1 0 0 0 Other Europe Master's 1979 m NaN 2013-09-07 NaN NaN NaN NaN 7 0 NaN 1
641124 MITx/6.00x/2013_Spring MHxPC130505931 1 1 0 0 India Secondary 1995 m NaN 2013-09-07 2013-09-07 59 1 NaN 2 0 NaN NaN
641126 MITx/6.00x/2013_Spring MHxPC130137331 1 1 0 0 United States Secondary 1992 m NaN 2013-09-07 2013-09-07 251 1 77 4 0 NaN NaN
641128 MITx/14.73x/2013_Spring MHxPC130256541 1 1 0 0 United States Master's 1982 m NaN 2013-09-07 2013-09-07 51 1 1 1 0 NaN NaN
641131 MITx/8.02x/2013_Spring MHxPC130226305 1 0 0 0 Unknown/Other Bachelor's 1988 m NaN 2013-09-07 2013-09-07 11 1 NaN 2 0 NaN NaN
641132 MITx/6.002x/2013_Spring MHxPC130030805 1 1 0 0 Pakistan Master's 1989 m NaN 2013-09-07 2013-09-07 29 1 NaN 1 0 NaN NaN
641133 MITx/6.00x/2013_Spring MHxPC130184108 1 1 0 0 Canada Bachelor's 1991 m NaN 2013-09-07 2013-09-07 97 1 4 2 0 NaN NaN
641136 MITx/6.00x/2013_Spring MHxPC130098513 1 1 0 0 United States Doctorate 1979 m NaN 2013-09-07 2013-09-07 74 1 14 1 0 NaN NaN

382385 rows × 20 columns

Stages of boolean indexing

  • select column whose values you're interested in, e.g. df['nchapters']
  • evaluate the result according to some test, e.g. df['nchapters']>0 -that produces a boolean series
  • use the boolean series to select the rows from the dataframe
    • df[df['nchapters]>0] results in a dataframe with only the rows that meet the test
In [54]:
country_not_france=(country_series!='France')
In [55]:
country_not_france
Out[55]:
0      True
1      True
2      True
3      True
4      True
5      True
6      True
7     False
8      True
9      True
10     True
11     True
12     True
13     True
14     True
...
641123    True
641124    True
641125    True
641126    True
641127    True
641128    True
641129    True
641130    True
641131    True
641132    True
641133    True
641134    True
641135    True
641136    True
641137    True
Name: final_cc_cname_DI, Length: 641138, dtype: bool
In [ ]:
 
In [45]:
df[df['final_cc_cname_DI']=='France']
Out[45]:
course_id userid_DI registered viewed explored certified final_cc_cname_DI LoE_DI YoB gender grade start_time_DI last_event_DI nevents ndays_act nplay_video nchapters nforum_posts roles incomplete_flag
7 HarvardX/CB22x/2013_Spring MHxPC130539455 1 1 0 0 France NaN NaN NaN 0 2013-01-01 2013-05-14 42 6 NaN 3 0 NaN NaN
256 HarvardX/CS50x/2012 MHxPC130595891 1 1 0 0 France NaN NaN NaN 0 2012-11-03 NaN NaN NaN NaN 2 0 NaN 1
423 HarvardX/CS50x/2012 MHxPC130247412 1 0 0 0 France NaN NaN NaN 0 2012-08-03 NaN NaN NaN NaN NaN 0 NaN NaN
449 HarvardX/CB22x/2013_Spring MHxPC130170185 1 1 0 0 France NaN NaN NaN 0.06 2013-01-23 2013-08-11 231 8 NaN 5 0 NaN NaN
730 HarvardX/CS50x/2012 MHxPC130254688 1 1 0 0 France NaN NaN NaN 0 2012-08-11 NaN NaN NaN NaN 2 0 NaN 1
807 HarvardX/ER22x/2013_Spring MHxPC130156847 1 1 0 0 France NaN NaN NaN 0 2013-01-14 2013-06-25 41 1 NaN 3 0 NaN NaN
928 HarvardX/CS50x/2012 MHxPC130058577 1 1 0 0 France NaN NaN NaN 0 2013-01-07 NaN NaN NaN NaN 1 0 NaN 1
1078 HarvardX/CS50x/2012 MHxPC130323627 1 0 0 0 France NaN NaN NaN 0 2012-08-23 NaN NaN NaN NaN NaN 0 NaN NaN
1079 HarvardX/PH207x/2012_Fall MHxPC130323627 1 0 0 0 France NaN NaN NaN 0 2012-08-23 NaN NaN NaN NaN NaN 0 NaN NaN
1171 HarvardX/CS50x/2012 MHxPC130270571 1 1 0 0 France NaN NaN NaN 0 2012-08-17 2013-06-07 10 2 NaN 4 0 NaN NaN
1236 HarvardX/CB22x/2013_Spring MHxPC130078849 1 0 0 0 France NaN NaN NaN 0 2013-02-15 2013-02-15 1 1 NaN NaN 0 NaN NaN
1237 HarvardX/CS50x/2012 MHxPC130078849 1 0 0 0 France NaN NaN NaN 0 2013-02-15 2013-02-15 1 1 NaN NaN 0 NaN NaN
1238 HarvardX/ER22x/2013_Spring MHxPC130078849 1 0 0 0 France NaN NaN NaN 0 2013-02-15 2013-02-15 1 1 NaN NaN 0 NaN NaN
1239 HarvardX/PH278x/2013_Spring MHxPC130078849 1 0 0 0 France NaN NaN NaN 0 2013-02-15 2013-02-15 1 1 NaN NaN 0 NaN NaN
1394 HarvardX/CS50x/2012 MHxPC130217709 1 1 1 0 France NaN NaN NaN 0 2013-01-21 2013-05-27 455 48 NaN 12 0 NaN NaN
1535 HarvardX/CS50x/2012 MHxPC130075520 1 1 0 0 France NaN NaN NaN 0 2012-09-12 NaN NaN 1 NaN 3 0 NaN 1
1536 HarvardX/ER22x/2013_Spring MHxPC130075520 1 0 0 0 France NaN NaN NaN 0 2013-03-03 NaN NaN 1 NaN NaN 0 NaN 1
1577 HarvardX/CS50x/2012 MHxPC130221204 1 0 0 0 France NaN NaN NaN 0 2012-08-23 NaN NaN NaN NaN NaN 0 NaN NaN
1607 HarvardX/ER22x/2013_Spring MHxPC130191457 1 0 0 0 France NaN NaN NaN 0 2013-01-14 NaN NaN 2 NaN NaN 0 NaN 1
1665 HarvardX/CS50x/2012 MHxPC130438857 1 1 0 0 France NaN NaN NaN 0 2012-10-08 NaN NaN NaN NaN 2 0 NaN 1
1825 HarvardX/ER22x/2013_Spring MHxPC130020425 1 1 0 0 France NaN NaN NaN 0 2012-12-21 2013-03-13 17 1 NaN 1 0 NaN NaN
1826 HarvardX/PH278x/2013_Spring MHxPC130020425 1 0 0 0 France NaN NaN NaN 0 2012-12-21 NaN NaN 1 NaN NaN 0 NaN 1
1947 HarvardX/CB22x/2013_Spring MHxPC130365176 1 0 0 0 France NaN NaN NaN 0 2012-12-21 2013-11-17 NaN 1 NaN NaN 0 NaN 1
1948 HarvardX/CS50x/2012 MHxPC130365176 1 1 0 0 France NaN NaN NaN 0 2012-12-21 2013-05-17 6 2 NaN 2 0 NaN NaN
1949 HarvardX/ER22x/2013_Spring MHxPC130365176 1 0 0 0 France NaN NaN NaN 0 2012-12-21 NaN NaN 1 NaN NaN 0 NaN 1
1950 HarvardX/PH278x/2013_Spring MHxPC130365176 1 0 0 0 France NaN NaN NaN 0 2012-12-21 2013-03-06 1 1 NaN NaN 0 NaN NaN
2524 HarvardX/CS50x/2012 MHxPC130304462 1 1 0 0 France NaN NaN NaN 0 2012-11-10 NaN NaN NaN NaN 1 0 NaN 1
2745 HarvardX/CS50x/2012 MHxPC130509069 1 1 1 0 France NaN NaN NaN 0 2012-08-19 2013-03-24 32 2 NaN 12 0 NaN NaN
2810 HarvardX/CS50x/2012 MHxPC130475365 1 0 0 0 France NaN NaN NaN 0.0 2012-09-05 NaN NaN NaN NaN NaN 0 NaN NaN
2926 HarvardX/PH207x/2012_Fall MHxPC130597633 1 1 0 0 France NaN NaN NaN 0.11 2012-10-17 2012-11-05 1465 7 202 4 0 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
637976 MITx/14.73x/2013_Spring MHxPC130061822 1 1 0 0 France Master's 1988 f NaN 2013-08-25 2013-08-25 43 1 6 2 0 NaN NaN
638023 MITx/14.73x/2013_Spring MHxPC130125559 1 0 0 0 France Master's 1990 f NaN 2013-08-25 NaN NaN NaN NaN NaN 0 NaN NaN
638069 MITx/8.02x/2013_Spring MHxPC130285866 1 0 0 0 France NaN NaN m NaN 2013-08-25 2013-08-25 1 1 NaN NaN 0 NaN NaN
638158 MITx/14.73x/2013_Spring MHxPC130260318 1 0 0 0 France Master's 1981 f NaN 2013-08-26 NaN NaN NaN NaN NaN 0 NaN NaN
638426 MITx/6.00x/2013_Spring MHxPC130491596 1 1 1 0 France Master's 1987 m NaN 2013-08-27 2013-09-07 2670 11 264 12 0 NaN NaN
638469 MITx/6.00x/2013_Spring MHxPC130564866 1 1 0 0 France Bachelor's 1984 m NaN 2013-08-27 2013-08-28 9 2 1 1 0 NaN NaN
638570 MITx/6.00x/2013_Spring MHxPC130169377 1 1 0 0 France NaN NaN NaN NaN 2013-08-28 2013-08-30 443 3 58 3 0 NaN NaN
638668 MITx/14.73x/2013_Spring MHxPC130027500 1 1 0 0 France Master's 1975 f NaN 2013-08-28 2013-09-04 641 4 61 3 0 NaN NaN
638681 MITx/14.73x/2013_Spring MHxPC130529177 1 0 0 0 France Master's 1988 m NaN 2013-08-28 NaN NaN NaN NaN 3 0 NaN 1
638713 MITx/6.002x/2013_Spring MHxPC130484595 1 1 0 0 France NaN NaN NaN NaN 2013-08-28 2013-08-28 15 1 7 2 0 NaN NaN
638814 MITx/6.00x/2013_Spring MHxPC130144367 1 0 0 0 France Master's 1990 m NaN 2013-08-28 NaN NaN NaN NaN NaN 0 NaN NaN
638825 MITx/14.73x/2013_Spring MHxPC130430070 1 0 0 0 France NaN NaN NaN NaN 2013-08-28 2013-08-28 1 1 NaN NaN 0 NaN NaN
638835 MITx/6.00x/2013_Spring MHxPC130138357 1 1 0 0 France Master's 1987 m NaN 2013-09-01 2013-09-01 681 1 261 5 0 NaN NaN
638959 MITx/14.73x/2013_Spring MHxPC130128755 1 0 0 0 France Master's 1988 f NaN 2013-08-29 2013-08-29 3 1 NaN NaN 0 NaN NaN
639270 MITx/6.00x/2013_Spring MHxPC130520698 1 1 0 0 France Master's 1991 m NaN 2013-09-03 2013-09-06 26 4 5 1 0 NaN NaN
639317 MITx/6.00x/2013_Spring MHxPC130301343 1 1 0 0 France Master's 1984 f NaN 2013-08-30 2013-08-30 58 1 9 5 0 NaN NaN
639722 MITx/6.002x/2013_Spring MHxPC130413420 1 1 0 0 France Bachelor's 1990 m NaN 2013-09-01 2013-09-02 25 2 2 2 0 NaN NaN
639723 MITx/6.00x/2013_Spring MHxPC130413420 1 1 0 0 France Bachelor's 1990 m NaN 2013-09-01 2013-09-01 23 1 2 2 0 NaN NaN
639807 MITx/6.00x/2013_Spring MHxPC130045331 1 1 0 0 France Bachelor's 1990 m NaN 2013-09-01 2013-09-01 40 1 13 2 0 NaN NaN
640128 MITx/8.MReV/2013_Summer MHxPC130373510 1 1 0 0 France NaN NaN NaN 0 2013-09-03 2013-09-03 4 1 NaN 1 0 NaN NaN
640208 MITx/6.00x/2013_Spring MHxPC130000556 1 1 0 0 France Master's 1989 m NaN 2013-09-03 2013-09-03 19 1 NaN 3 0 NaN NaN
640249 MITx/6.00x/2013_Spring MHxPC130323078 1 0 0 0 France Bachelor's 1989 m NaN 2013-09-03 NaN NaN NaN NaN 3 0 NaN 1
640463 MITx/14.73x/2013_Spring MHxPC130393176 1 1 0 0 France Master's 1983 m NaN 2013-09-04 2013-09-04 7 1 1 1 0 NaN NaN
640487 MITx/14.73x/2013_Spring MHxPC130280987 1 0 0 0 France Master's 1986 f NaN 2013-09-04 2013-09-04 1 1 NaN NaN 0 NaN NaN
640506 MITx/14.73x/2013_Spring MHxPC130447919 1 0 0 0 France Secondary 1990 m NaN 2013-09-04 2013-09-04 1 1 NaN NaN 0 NaN NaN
640579 MITx/6.00x/2013_Spring MHxPC130195185 1 1 0 0 France Master's 1972 m NaN 2013-09-04 2013-09-05 1120 2 111 3 0 NaN NaN
640646 MITx/6.00x/2013_Spring MHxPC130546344 1 1 1 0 France Master's 1980 m NaN 2013-09-05 2013-09-07 930 3 57 13 0 NaN NaN
640655 MITx/6.00x/2013_Spring MHxPC130068634 1 0 0 0 France Master's 1990 m NaN 2013-09-05 2013-09-05 1 1 NaN NaN 0 NaN NaN
640898 MITx/14.73x/2013_Spring MHxPC130556151 1 1 0 0 France Bachelor's 1990 f NaN 2013-09-07 2013-09-07 40 1 5 1 0 NaN NaN
640935 MITx/6.00x/2013_Spring MHxPC130405848 1 1 0 0 France Bachelor's 1988 m NaN 2013-09-06 2013-09-06 69 1 7 4 0 NaN NaN

4700 rows × 20 columns

In [ ]:
 
In [ ]:
 
In [ ]:
 

Part II: Movie ratings-recommender engines

Election Mining

Campaigns are moving away from the meaningless labels of pollsters and newsweeklies — “Nascar dads” and “waitress moms” — and moving toward treating each voter as a separate person. In 2012 you didn’t just have to be an African-American from Akron or a suburban married female age 45 to 54. More and more, the information age allows people to be complicated, contradictory and unique. New technologies and an abundance of data may rattle the senses, but they are also bringing a fresh appreciation of the value of the individual to American politics.

- Ethan Roeder, “I Am Not Big Brother” http://www.nytimes.com/2012/12/06/opinion/i-am-not-big-brother.html?_r=0.
In [ ]:
 
In [ ]:
 
In [ ]:
    films=pd.read_csv('./ml-100k/u.item', sep="|", names=["movie id", "movie_title", "release_date", "video_release_date", "IMDb_URL", "unknown", "Action","Adventure", "Animation", "Children's", "Comedy", "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror", "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"])
In [ ]:
users=pd.read_csv('./ml-100k/u.user', sep="|", names=["user_id", "age", "gender","occupation","zip_code"], index_col="user_id")
In [ ]: