Number munging: vectors, Pandas, probabilities

In [1]:
# Render our plots inline
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

pd.set_option('display.mpl_style', 'default') # Make the graphs a bit prettier
plt.rcParams['figure.figsize'] = (15, 5)
In [2]:
#this presumes you've the two data sets locally

You can find the dataset and documentation at http://harvardx.harvard.edu/dataset

In [2]:
 
In [2]:
 

Our ritual: Exploratory data analysis

Exploratory data analysis (EDA) seeks to reveal structure, or simple descriptions, in data. We look at numbers and graphs and try to find patterns.

- Persi Diaconis, "Theories of Data Analysis: From Magical Thinking Through Classical statistics"

. . . proceeding via a ‘dustbowl’ empiricism is dangerous at worst and foolish at best . . . . The purely empirical approach is particularly dangerous in an age when computers and packaged programs are readily available, since there is temptation to substitute immediate empirical analysis for more analytic thought and theory building.

- Einhorn, “Alchemy in the Behavioral Sciences,” 1972

. . . we can view the techniques of EDA as a ritual designed to reveal patters in a data set. Thus, we may believe that naturally occurring data sets contain structure, that EDA is a useful vehicle for revealing the structure. . . . If we make no attempt to check whether the structure could have arisen by chance, and tend to accept the findinds as gospel, then the ritual comes close to magical thinking. ... a controlled form of magical thinking--in the guise of 'working hypothesis'--is a basic ingredient of scientific progress.

- Persi Diaconis, "Theories of Data Analysis: From Magical Thinking Through Classical statistics"

From data to databases to data mining

  • move from accessing and manipulating data to performing ever more complicated queries on our data

Pandas first-line python tool for EDA

  • rich data structures
  • powerful ways to slice, dice, reformate, fix, and eliminate data
    • taste of what can do
  • rich queries like databases

Pandas: charismatic megafauna

In [3]:
import pandas as pd

Series

Use example of multiple average house prices by year by CPI.

If we wanted to understanding the fluctuations in house prices over time, we can use a new datatype called series.

In [4]:
CPI={"2010": 218.056, "2011": 224.939, "2012": 229.594, "2013": 232.957} #http://www.bls.gov/cpi/home.htm

The series method converts this dict to

In [5]:
CPI_series=pd.Series(CPI)
In [6]:
CPI_series
Out[6]:
2010    218.056
2011    224.939
2012    229.594
2013    232.957
dtype: float64
In [7]:
House_sale_mean={"2010": 100000, "2011": 100000, "2012": 100000, "2013": 100000}
In [8]:
House_sale_series=pd.Series(House_sale_mean)
In [9]:
House_sale_series
Out[9]:
2010    100000
2011    100000
2012    100000
2013    100000
dtype: int64

The CPI provides "a measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services." A higher number means it costs more to buy the same goods. It was set to 100 in 1982-4.

We can thus use it to measure the effects of inflation on the value of houses in our toy example.

In [10]:
(House_sale_series/CPI_series)
Out[10]:
2010    458.597791
2011    444.564971
2012    435.551452
2013    429.263770
dtype: float64

What's going here: pandas has taken the two series and divided the values that share the same index.

In [11]:
#If we multiply each one by 100, we'll get the value of our houses in 1982-4 dollars.
(House_sale_series/CPI_series)*100
Out[11]:
2010    45859.779139
2011    44456.497095
2012    43555.145169
2013    42926.376971
dtype: float64
In [12]:
inflation_adjusted=(House_sale_series/CPI_series)*100
In [13]:
#can perform calculations on individual 
inflation_adjusted['2013']/inflation_adjusted['2010']
Out[13]:
0.93603540567572574
In [14]:
#plotting is simple as pie
inflation_adjusted.plot()
Out[14]:
<matplotlib.axes.AxesSubplot at 0x7ff736b76a50>
/usr/lib/pymodules/python2.7/matplotlib/font_manager.py:1236: UserWarning: findfont: Font family ['monospace'] not found. Falling back to Bitstream Vera Sans
  (prop.get_family(), self.defaultFamily[fontext]))
/usr/lib/pymodules/python2.7/matplotlib/font_manager.py:1246: UserWarning: findfont: Could not match :family=Bitstream Vera Sans:style=normal:variant=normal:weight=normal:stretch=normal:size=medium. Returning /usr/share/matplotlib/mpl-data/fonts/ttf/cmb10.ttf
  UserWarning)
In [15]:
(House_sale_series/CPI_series).plot(title="Sorry, kids. Blame X, where X is current politician we don't like.")
Out[15]:
<matplotlib.axes.AxesSubplot at 0x7ff736b76f50>
/usr/lib/pymodules/python2.7/matplotlib/font_manager.py:1246: UserWarning: findfont: Could not match :family=Bitstream Vera Sans:style=normal:variant=normal:weight=normal:stretch=normal:size=x-large. Returning /usr/share/matplotlib/mpl-data/fonts/ttf/cmb10.ttf
  UserWarning)

What's super deceptive about this plot?

dataframes

The library Pandas provides us with a powerful overlay that lets us use matrices but always keep their row and column names, or as a spreadsheet on speed. It allows us to work directly with the datatype "Dataframes" that keeps track of values and their names for us. And it allows us to perform many operations on slices of the dataframe without having to run for loops and the like. This is more convenient and involves faster processing.

In [16]:
#Let's start with yet another way to read csv files, this time from `pandas`
df=pd.read_csv('./HMXPC13_DI_v2_5-14-14.csv', sep=",")
In [17]:
#take a look--it's a biggie
df
Out[17]:
course_id userid_DI registered viewed explored certified final_cc_cname_DI LoE_DI YoB gender grade start_time_DI last_event_DI nevents ndays_act nplay_video nchapters nforum_posts roles incomplete_flag
0 HarvardX/CB22x/2013_Spring MHxPC130442623 1 0 0 0 United States NaN NaN NaN 0 2012-12-19 2013-11-17 NaN 9 NaN NaN 0 NaN 1
1 HarvardX/CS50x/2012 MHxPC130442623 1 1 0 0 United States NaN NaN NaN 0 2012-10-15 NaN NaN 9 NaN 1 0 NaN 1
2 HarvardX/CB22x/2013_Spring MHxPC130275857 1 0 0 0 United States NaN NaN NaN 0 2013-02-08 2013-11-17 NaN 16 NaN NaN 0 NaN 1
3 HarvardX/CS50x/2012 MHxPC130275857 1 0 0 0 United States NaN NaN NaN 0 2012-09-17 NaN NaN 16 NaN NaN 0 NaN 1
4 HarvardX/ER22x/2013_Spring MHxPC130275857 1 0 0 0 United States NaN NaN NaN 0 2012-12-19 NaN NaN 16 NaN NaN 0 NaN 1
5 HarvardX/PH207x/2012_Fall MHxPC130275857 1 1 1 0 United States NaN NaN NaN 0 2012-09-17 2013-05-23 502 16 50 12 0 NaN NaN
6 HarvardX/PH278x/2013_Spring MHxPC130275857 1 0 0 0 United States NaN NaN NaN 0 2013-02-08 NaN NaN 16 NaN NaN 0 NaN 1
7 HarvardX/CB22x/2013_Spring MHxPC130539455 1 1 0 0 France NaN NaN NaN 0 2013-01-01 2013-05-14 42 6 NaN 3 0 NaN NaN
8 HarvardX/CB22x/2013_Spring MHxPC130088379 1 1 0 0 United States NaN NaN NaN 0 2013-02-18 2013-03-17 70 3 NaN 3 0 NaN NaN
9 HarvardX/CS50x/2012 MHxPC130088379 1 1 0 0 United States NaN NaN NaN 0 2012-10-20 NaN NaN 12 NaN 3 0 NaN 1
10 HarvardX/ER22x/2013_Spring MHxPC130088379 1 1 0 0 United States NaN NaN NaN 0 2013-02-23 2013-06-14 17 2 NaN 2 0 NaN NaN
11 HarvardX/ER22x/2013_Spring MHxPC130198098 1 1 0 0 United States NaN NaN NaN 0 2013-06-17 2013-06-17 32 1 NaN 3 0 NaN NaN
12 HarvardX/CB22x/2013_Spring MHxPC130024894 1 1 0 0 United States NaN NaN NaN 0.07 2013-01-24 2013-08-03 175 9 NaN 7 0 NaN NaN
13 HarvardX/CS50x/2012 MHxPC130024894 1 1 0 0 United States NaN NaN NaN 0 2013-06-27 NaN NaN 2 NaN 2 0 NaN 1
14 HarvardX/ER22x/2013_Spring MHxPC130024894 1 1 0 0 United States NaN NaN NaN 0 2012-12-19 2013-08-17 78 5 NaN 4 0 NaN NaN
15 HarvardX/PH207x/2012_Fall MHxPC130024894 1 1 0 0 United States NaN NaN NaN 0 2012-07-26 2013-01-16 75 14 5 2 0 NaN NaN
16 HarvardX/PH278x/2013_Spring MHxPC130024894 1 1 0 0 United States NaN NaN NaN 0 2013-07-30 2013-08-27 11 2 2 1 0 NaN NaN
17 HarvardX/CS50x/2012 MHxPC130080986 1 1 0 0 United States NaN NaN NaN 0 2012-10-15 NaN NaN 11 NaN 1 0 NaN 1
18 HarvardX/PH207x/2012_Fall MHxPC130080986 1 1 0 0 United States NaN NaN NaN 0 2012-10-25 2012-12-04 56 11 1 2 1 NaN NaN
19 HarvardX/CS50x/2012 MHxPC130063375 1 1 0 0 Unknown/Other NaN NaN NaN 0 2012-10-19 NaN NaN NaN NaN 1 0 NaN 1
20 HarvardX/CS50x/2012 MHxPC130094371 1 1 0 0 United States NaN NaN NaN 0 2013-03-03 2013-03-03 7 1 NaN 2 0 NaN NaN
21 HarvardX/CS50x/2012 MHxPC130229084 1 1 0 0 Mexico NaN NaN NaN 0 2012-10-15 NaN NaN NaN NaN 1 0 NaN 1
22 HarvardX/CS50x/2012 MHxPC130300925 1 1 0 0 United States NaN NaN NaN 0 2012-10-24 NaN NaN 2 NaN 1 0 NaN 1
23 HarvardX/ER22x/2013_Spring MHxPC130300925 1 1 0 0 United States NaN NaN NaN 0 2012-12-20 2013-05-18 15 2 NaN 2 0 NaN NaN
24 HarvardX/CS50x/2012 MHxPC130417650 1 1 0 0 Australia NaN NaN NaN 0 2012-10-29 2013-03-04 1 1 NaN 2 0 NaN NaN
25 HarvardX/CS50x/2012 MHxPC130506580 1 0 0 0 United States NaN NaN NaN 0 2012-09-04 NaN NaN NaN NaN NaN 0 NaN NaN
26 HarvardX/CS50x/2012 MHxPC130298257 1 0 0 0 United States NaN NaN NaN 0 2012-09-05 NaN NaN NaN NaN 3 0 NaN 1
27 HarvardX/CS50x/2012 MHxPC130500569 1 1 0 0 United States NaN NaN NaN 0 2012-10-22 2013-03-30 6 1 NaN 5 0 NaN NaN
28 HarvardX/CS50x/2012 MHxPC130466479 1 1 0 0 Unknown/Other NaN NaN NaN 0 2013-01-07 NaN NaN NaN NaN 1 0 NaN 1
29 HarvardX/CB22x/2013_Spring MHxPC130340959 1 1 0 0 United States NaN NaN NaN 0.05 2013-02-11 2013-04-06 285 8 NaN 4 0 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
641108 MITx/6.002x/2013_Spring MHxPC130140735 1 1 0 0 United States Bachelor's 1991 m NaN 2013-09-07 2013-09-07 59 1 5 3 0 NaN NaN
641109 MITx/6.00x/2013_Spring MHxPC130493130 1 0 0 0 United Kingdom Master's 1977 m NaN 2013-09-07 NaN NaN NaN NaN 2 0 NaN 1
641110 MITx/6.00x/2013_Spring MHxPC130400592 1 1 0 0 Other Europe Secondary 1992 m NaN 2013-09-07 2013-09-07 395 1 51 4 0 NaN NaN
641111 MITx/6.00x/2013_Spring MHxPC130109892 1 1 0 0 India Secondary 1995 m NaN 2013-09-07 2013-09-07 49 1 14 2 0 NaN NaN
641112 MITx/14.73x/2013_Spring MHxPC130183007 1 0 0 0 India Master's 1985 m NaN 2013-09-07 NaN NaN NaN NaN NaN 0 NaN NaN
641113 MITx/8.MReV/2013_Summer MHxPC130261281 1 1 0 0 India Secondary 1994 m 0 2013-09-07 2013-09-07 8 1 NaN 1 0 NaN NaN
641114 MITx/6.00x/2013_Spring MHxPC130481990 1 1 0 0 India Bachelor's 1989 m NaN 2013-09-07 2013-09-07 22 1 5 1 0 NaN NaN
641115 MITx/6.00x/2013_Spring MHxPC130528581 1 0 0 0 United States Bachelor's 1990 f NaN 2013-09-07 2013-09-07 2 1 NaN 3 0 NaN NaN
641116 MITx/14.73x/2013_Spring MHxPC130555418 1 0 0 0 Unknown/Other Bachelor's 1988 m NaN 2013-09-07 NaN NaN NaN NaN NaN 0 NaN NaN
641117 MITx/6.002x/2013_Spring MHxPC130408810 1 0 0 0 India Secondary 1993 m NaN 2013-09-07 2013-09-07 2 1 NaN 3 0 NaN NaN
641118 MITx/6.00x/2013_Spring MHxPC130040184 1 0 0 0 United States Secondary 1991 m NaN 2013-09-07 NaN NaN NaN NaN NaN 0 NaN NaN
641119 MITx/6.002x/2013_Spring MHxPC130566049 1 0 0 0 Other Europe Master's 1982 m NaN 2013-09-07 2013-09-07 2 1 NaN 2 0 NaN NaN
641120 MITx/8.MReV/2013_Summer MHxPC130374105 1 1 0 0 India Bachelor's 1992 m 0 2013-09-07 2013-09-07 49 1 NaN 1 0 NaN NaN
641121 MITx/6.00x/2013_Spring MHxPC130282999 1 0 0 0 Other Europe Master's 1979 m NaN 2013-09-07 NaN NaN NaN NaN 7 0 NaN 1
641122 MITx/8.MReV/2013_Summer MHxPC130556398 1 0 0 0 India Bachelor's 1985 m 0 2013-09-07 2013-09-07 1 1 NaN NaN 0 NaN NaN
641123 MITx/6.00x/2013_Spring MHxPC130573334 1 0 0 0 Spain Bachelor's 1989 m NaN 2013-09-07 2013-09-07 1 1 NaN NaN 0 NaN NaN
641124 MITx/6.00x/2013_Spring MHxPC130505931 1 1 0 0 India Secondary 1995 m NaN 2013-09-07 2013-09-07 59 1 NaN 2 0 NaN NaN
641125 MITx/6.002x/2013_Spring MHxPC130280976 1 0 0 0 United States Bachelor's NaN m NaN 2013-09-07 2013-09-07 2 1 NaN NaN 0 NaN NaN
641126 MITx/6.00x/2013_Spring MHxPC130137331 1 1 0 0 United States Secondary 1992 m NaN 2013-09-07 2013-09-07 251 1 77 4 0 NaN NaN
641127 MITx/6.002x/2013_Spring MHxPC130271624 1 0 0 0 India Bachelor's 1989 m NaN 2013-09-07 2013-09-07 1 1 NaN NaN 0 NaN NaN
641128 MITx/14.73x/2013_Spring MHxPC130256541 1 1 0 0 United States Master's 1982 m NaN 2013-09-07 2013-09-07 51 1 1 1 0 NaN NaN
641129 MITx/6.00x/2013_Spring MHxPC130021638 1 0 0 0 Unknown/Other Bachelor's 1988 m NaN 2013-09-07 NaN NaN NaN NaN NaN 0 NaN NaN
641130 MITx/14.73x/2013_Spring MHxPC130591057 1 0 0 0 Canada Bachelor's NaN f NaN 2013-09-07 2013-09-07 6 1 NaN NaN 0 NaN NaN
641131 MITx/8.02x/2013_Spring MHxPC130226305 1 0 0 0 Unknown/Other Bachelor's 1988 m NaN 2013-09-07 2013-09-07 11 1 NaN 2 0 NaN NaN
641132 MITx/6.002x/2013_Spring MHxPC130030805 1 1 0 0 Pakistan Master's 1989 m NaN 2013-09-07 2013-09-07 29 1 NaN 1 0 NaN NaN
641133 MITx/6.00x/2013_Spring MHxPC130184108 1 1 0 0 Canada Bachelor's 1991 m NaN 2013-09-07 2013-09-07 97 1 4 2 0 NaN NaN
641134 MITx/6.00x/2013_Spring MHxPC130359782 1 0 0 0 Other Europe Bachelor's 1991 f NaN 2013-09-07 2013-09-07 1 1 NaN NaN 0 NaN NaN
641135 MITx/6.002x/2013_Spring MHxPC130098513 1 0 0 0 United States Doctorate 1979 m NaN 2013-09-07 2013-09-07 1 1 NaN NaN 0 NaN NaN
641136 MITx/6.00x/2013_Spring MHxPC130098513 1 1 0 0 United States Doctorate 1979 m NaN 2013-09-07 2013-09-07 74 1 14 1 0 NaN NaN
641137 MITx/8.02x/2013_Spring MHxPC130098513 1 0 0 0 United States Doctorate 1979 m NaN 2013-09-07 NaN NaN 1 NaN NaN 0 NaN 1

641138 rows × 20 columns

Note at the bottom that the display tells us how many rows and columns we're dealing with.

As a general rule, pandas dataframe objects default to slicing by column using a syntax you'll know from dicts as in df["course_id"].

In [18]:
df["course_id"]
Out[18]:
0      HarvardX/CB22x/2013_Spring
1             HarvardX/CS50x/2012
2      HarvardX/CB22x/2013_Spring
3             HarvardX/CS50x/2012
4      HarvardX/ER22x/2013_Spring
5       HarvardX/PH207x/2012_Fall
6     HarvardX/PH278x/2013_Spring
7      HarvardX/CB22x/2013_Spring
8      HarvardX/CB22x/2013_Spring
9             HarvardX/CS50x/2012
10     HarvardX/ER22x/2013_Spring
11     HarvardX/ER22x/2013_Spring
12     HarvardX/CB22x/2013_Spring
13            HarvardX/CS50x/2012
14     HarvardX/ER22x/2013_Spring
...
641123     MITx/6.00x/2013_Spring
641124     MITx/6.00x/2013_Spring
641125    MITx/6.002x/2013_Spring
641126     MITx/6.00x/2013_Spring
641127    MITx/6.002x/2013_Spring
641128    MITx/14.73x/2013_Spring
641129     MITx/6.00x/2013_Spring
641130    MITx/14.73x/2013_Spring
641131     MITx/8.02x/2013_Spring
641132    MITx/6.002x/2013_Spring
641133     MITx/6.00x/2013_Spring
641134     MITx/6.00x/2013_Spring
641135    MITx/6.002x/2013_Spring
641136     MITx/6.00x/2013_Spring
641137     MITx/8.02x/2013_Spring
Name: course_id, Length: 641138, dtype: object

We can select some subset of rows as you might expect:

In [19]:
df["course_id"][3340:3350]
Out[19]:
3340            HarvardX/CS50x/2012
3341     HarvardX/ER22x/2013_Spring
3342    HarvardX/PH278x/2013_Spring
3343            HarvardX/CS50x/2012
3344            HarvardX/CS50x/2012
3345     HarvardX/ER22x/2013_Spring
3346            HarvardX/CS50x/2012
3347     HarvardX/CB22x/2013_Spring
3348            HarvardX/CS50x/2012
3349            HarvardX/CS50x/2012
Name: course_id, dtype: object
In [20]:
df[3340:3350]
Out[20]:
course_id userid_DI registered viewed explored certified final_cc_cname_DI LoE_DI YoB gender grade start_time_DI last_event_DI nevents ndays_act nplay_video nchapters nforum_posts roles incomplete_flag
3340 HarvardX/CS50x/2012 MHxPC130386705 1 1 0 0 Russian Federation NaN NaN NaN 0 2012-08-17 NaN NaN NaN NaN 2 0 NaN 1
3341 HarvardX/ER22x/2013_Spring MHxPC130432757 1 1 0 0 United States NaN NaN NaN NaN 2013-09-05 2013-09-05 16 1 NaN 2 0 NaN NaN
3342 HarvardX/PH278x/2013_Spring MHxPC130432757 1 0 0 0 United States NaN NaN NaN NaN 2012-12-25 NaN NaN 1 NaN NaN 0 NaN 1
3343 HarvardX/CS50x/2012 MHxPC130382204 1 1 0 0 Ukraine NaN NaN NaN 0 2012-11-30 NaN NaN NaN NaN 5 0 NaN 1
3344 HarvardX/CS50x/2012 MHxPC130142047 1 1 0 0 Spain NaN NaN NaN 0.0 2013-07-12 2013-07-12 8 1 NaN 1 0 NaN NaN
3345 HarvardX/ER22x/2013_Spring MHxPC130191600 1 0 0 0 India NaN NaN NaN 0 2012-12-23 NaN NaN NaN NaN NaN 0 NaN NaN
3346 HarvardX/CS50x/2012 MHxPC130079233 1 0 0 0 United States NaN NaN NaN 0 2012-08-17 NaN NaN NaN NaN NaN 0 NaN NaN
3347 HarvardX/CB22x/2013_Spring MHxPC130277592 1 1 0 0 United States NaN NaN NaN 0.04 2013-01-23 2013-04-04 333 8 NaN 4 0 NaN NaN
3348 HarvardX/CS50x/2012 MHxPC130429812 1 1 0 0 United Kingdom NaN NaN NaN 0 2012-08-18 NaN NaN NaN NaN 2 0 NaN 1
3349 HarvardX/CS50x/2012 MHxPC130503405 1 0 0 0 Russian Federation NaN NaN NaN 0.0 2012-09-03 NaN NaN NaN NaN NaN 0 NaN NaN
In [21]:
df[666]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-21-342ddc6d1cab> in <module>()
----> 1 df[666]

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1682             return self._getitem_multilevel(key)
   1683         else:
-> 1684             return self._getitem_column(key)
   1685 
   1686     def _getitem_column(self, key):

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _getitem_column(self, key)
   1689         # get column
   1690         if self.columns.is_unique:
-> 1691             return self._get_item_cache(key)
   1692 
   1693         # duplicate columns & possible reduce dimensionaility

/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
   1050         res = cache.get(item)
   1051         if res is None:
-> 1052             values = self._data.get(item)
   1053             res = self._box_item_values(item, values)
   1054             cache[item] = res

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in get(self, item)
   2535 
   2536             if not isnull(item):
-> 2537                 loc = self.items.get_loc(item)
   2538             else:
   2539                 indexer = np.arange(len(self.items))[isnull(self.items)]

/usr/local/lib/python2.7/dist-packages/pandas/core/index.pyc in get_loc(self, key)
   1154         loc : int if unique index, possibly slice or mask if not
   1155         """
-> 1156         return self._engine.get_loc(_values_from_object(key))
   1157 
   1158     def get_value(self, series, key):

/usr/local/lib/python2.7/dist-packages/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3353)()

/usr/local/lib/python2.7/dist-packages/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3233)()

/usr/local/lib/python2.7/dist-packages/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11148)()

/usr/local/lib/python2.7/dist-packages/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11101)()

KeyError: 666

Huh? I have row 666! pandas doesn't let you select a single row directly. Instead, use the .ix method.

In [22]:
df.ix[666]
Out[22]:
course_id            HarvardX/CS50x/2012
userid_DI                 MHxPC130297337
registered                             1
viewed                                 0
explored                               0
certified                              0
final_cc_cname_DI         United Kingdom
LoE_DI                               NaN
YoB                                  NaN
gender                               NaN
grade                                  0
start_time_DI                 2012-08-17
last_event_DI                        NaN
nevents                              NaN
ndays_act                            NaN
nplay_video                          NaN
nchapters                            NaN
nforum_posts                           0
roles                                NaN
incomplete_flag                      NaN
Name: 666, dtype: object

Why? A good question. Now try passing a list:

In [23]:
df.ix[[666]]
Out[23]:
course_id userid_DI registered viewed explored certified final_cc_cname_DI LoE_DI YoB gender grade start_time_DI last_event_DI nevents ndays_act nplay_video nchapters nforum_posts roles incomplete_flag
666 HarvardX/CS50x/2012 MHxPC130297337 1 0 0 0 United Kingdom NaN NaN NaN 0 2012-08-17 NaN NaN NaN NaN NaN 0 NaN NaN

What types of data did pandas import from the csv?

In [24]:
df.dtypes
Out[24]:
course_id             object
userid_DI             object
registered             int64
viewed                 int64
explored               int64
certified              int64
final_cc_cname_DI     object
LoE_DI                object
YoB                  float64
gender                object
grade                 object
start_time_DI         object
last_event_DI         object
nevents              float64
ndays_act            float64
nplay_video          float64
nchapters            float64
nforum_posts           int64
roles                float64
incomplete_flag      float64
dtype: object

In inputing CSV, Pandas parses each column and attempts to discern what sort of data is within. It's good but not infallible.

  • Pandas is particularly good with dates: you simple tell it which columns to parse as dates. Let's refine our reading of the CSV to parse the dates.
In [25]:
df=pd.read_csv('HMXPC13_DI_v2_5-14-14.csv', sep="," , parse_dates=['start_time_DI', 'last_event_DI'])

We pass a list of the columns to consider as dates.

In [26]:
df.dtypes
Out[26]:
course_id                    object
userid_DI                    object
registered                    int64
viewed                        int64
explored                      int64
certified                     int64
final_cc_cname_DI            object
LoE_DI                       object
YoB                         float64
gender                       object
grade                        object
start_time_DI        datetime64[ns]
last_event_DI        datetime64[ns]
nevents                     float64
ndays_act                   float64
nplay_video                 float64
nchapters                   float64
nforum_posts                  int64
roles                       float64
incomplete_flag             float64
dtype: object

We don't need all those fields. We can cull them by name using the method drop.

In [27]:
df=df.drop(['userid_DI', 'roles', 'incomplete_flag', 'nforum_posts'], axis=1) # axis=1 means the column names, not the rows

We can pick out columns using their names--just like with dicts:

In [28]:
df['final_cc_cname_DI'][100:110]
Out[28]:
100         United States
101         United States
102         United States
103         United States
104    Russian Federation
105    Russian Federation
106         United States
107         United States
108         United States
109         United States
Name: final_cc_cname_DI, dtype: object

We can easily produce new dataframes that retain the relevant indices.

In [29]:
df[['gender', 'ndays_act','nplay_video']][1781:1787] 
#note the double [[]]--you're providing a list [x,y. . . ] of the columns you want
Out[29]:
gender ndays_act nplay_video
1781 NaN NaN NaN
1782 NaN 1 NaN
1783 NaN 2 NaN
1784 NaN 3 NaN
1785 NaN 1 NaN
1786 NaN 1 NaN

"NaN" is not a English grandmother. It is an unreported or empty value.

Often we might treat it as a zero, but we need to be careful.

Would it matter in the case of nplay_video?

How about gender?

Vectors, matrices, and functions, oh my!

You've been writing functions and methods for the past couple of weeks. You'll recall functions like $f(x)$ and $g(x)$ from your previous math training.

Functions can be defined on vector and matrix qualities.

For our example, just consider the vector from the intersection of the x and y axes to a point (a,b), which we might write $\bar{v}=[a,b]$.

Some take a vector and return a non-vectoral number, called a scalar. For example, if we have the point (2,2) on the x and y axes, the length of the vector [2,2] from the origin to that point can be found with a standard distance function

$d([x,y])=|\sqrt{x^2+y^2}|$

$d([2,2])=|\sqrt{2^2+2^2}|=|\sqrt{8}|$

Or, more easily, we might compute the mean of the elements of a column in a vector.

Some functions take a vector and return the same kind of vector.

$f([x,y])=[2x,3y]$

$f([1,2])=[2,6]$

For our purposes

  1. function drawing upon all the parts of a vector or matrix (a mean)
  2. function applied to each element individually
  3. function applied to a particular slice of the data, or a systematic slice of the data

It's pretty easy for us to think of vectors in 2 or 3 dimensions. In much data analysis, we'll use many more dimensions. Consider just one of our rows in our dataframe today:

In [29]:
 

Pandas makes it supremely easy to plot time series

  • we had it parse the dates in two columns above
In [30]:
df['start_time_DI'] # This produces a series of start times
Out[30]:
0    2012-12-19
1    2012-10-15
2    2013-02-08
3    2012-09-17
4    2012-12-19
5    2012-09-17
6    2013-02-08
7    2013-01-01
8    2013-02-18
9    2012-10-20
10   2013-02-23
11   2013-06-17
12   2013-01-24
13   2013-06-27
14   2012-12-19
...
641123   2013-09-07
641124   2013-09-07
641125   2013-09-07
641126   2013-09-07
641127   2013-09-07
641128   2013-09-07
641129   2013-09-07
641130   2013-09-07
641131   2013-09-07
641132   2013-09-07
641133   2013-09-07
641134   2013-09-07
641135   2013-09-07
641136   2013-09-07
641137   2013-09-07
Name: start_time_DI, Length: 641138, dtype: datetime64[ns]

What if we wanted to figure out how many people started each day? We could run a for loop and compute it ourslves.

But panda has got this cover.

We use the .value_counts() method. It sums the number of times a each given value in the data occurs in a series.

In [31]:
startdates=df['start_time_DI'].value_counts()
In [32]:
startdates
Out[32]:
2012-08-17    10165
2013-01-23     8368
2012-10-15     6766
2012-08-16     6369
2012-12-20     5858
2013-02-14     5810
2012-12-21     5809
2012-08-18     5531
2012-08-13     5247
2013-03-03     5053
2012-10-16     4639
2012-07-24     4635
2013-02-15     4436
2013-01-22     4263
2012-08-20     4107
...
2013-07-15    396
2013-07-18    390
2013-07-10    386
2013-07-20    378
2013-07-09    374
2013-07-08    365
2013-07-04    357
2013-07-12    334
2013-07-05    307
2013-07-14    279
2013-07-13    275
2013-07-06    274
2013-07-07    273
2012-07-23      5
2013-09-08      1
Length: 413
In [33]:
startdates.plot()
Out[33]:
<matplotlib.axes.AxesSubplot at 0x7ff72dce9350>

How much video did each person watch?

In [34]:
df['nplay_video'].dropna().plot()
Out[34]:
<matplotlib.axes.AxesSubplot at 0x7ff72dbb63d0>

Who are the frickin' overachievers? We'll investigate in just a sec.

How about: how many people enrolled per country?

work with your BFF for 2 minutes to figure out
1. how to pick out the field including country
2. how to count them
In [35]:
country=df['final_cc_cname_DI'].value_counts()
In [36]:
country
Out[36]:
United States                             184240
India                                      88696
Unknown/Other                              82029
Other Europe                               40377
Other Africa                               23897
United Kingdom                             22131
Brazil                                     17856
Other Middle East/Central Asia             17325
Other South Asia                           12992
Canada                                     12738
Pakistan                                   10824
Russian Federation                         10432
Spain                                      10003
Other South America                         9916
Egypt                                       9286
Germany                                     8074
Nigeria                                     7483
Other East Asia                             6446
Australia                                   6419
Mexico                                      5638
Philippines                                 5374
Poland                                      5226
China                                       5170
Greece                                      5162
Colombia                                    4803
France                                      4700
Other North & Central Amer., Caribbean      4434
Ukraine                                     4100
Morocco                                     3966
Indonesia                                   3410
Bangladesh                                  3182
Japan                                       2270
Portugal                                    2193
Other Oceania                                346
dtype: int64
In [37]:
country[:15].plot(kind='bar')
Out[37]:
<matplotlib.axes.AxesSubplot at 0x7ff728c11a90>

Selecting data: boolean indexing

Often we want to pick out all the rows that have certain attributes from the entire dataframe: those belonging to French people, or all the self-identified women, or the overachievers.

To do this create a vector where each element indicates whether each row satisfies the condition. It looks like a long list of trues and falses, so it's called a boolean vector.

In [38]:
#pick out all the people who watched over 10000[!] minutes of video
df['nplay_video']>10000
Out[38]:
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
...
641123    False
641124    False
641125    False
641126    False
641127    False
641128    False
641129    False
641130    False
641131    False
641132    False
641133    False
641134    False
641135    False
641136    False
641137    False
Name: nplay_video, Length: 641138, dtype: bool
In [39]:
#use that long vector of trues and falses to pick out just those rows
df[df['nplay_video']>10000]
Out[39]:
course_id registered viewed explored certified final_cc_cname_DI LoE_DI YoB gender grade start_time_DI last_event_DI nevents ndays_act nplay_video nchapters
83088 HarvardX/PH207x/2012_Fall 1 1 1 1 Other Africa NaN 1986 m 0.93 2012-09-07 2013-01-18 15182 34 10668 15
130094 HarvardX/PH207x/2012_Fall 1 1 1 1 Egypt Bachelor's 1988 f 0.92 2012-10-13 2013-05-19 33820 98 12983 16
136405 HarvardX/PH207x/2012_Fall 1 1 1 1 Portugal Master's 1986 m 0.94 2012-10-15 2013-01-20 37440 71 17314 16
170632 HarvardX/PH207x/2012_Fall 1 1 1 0 India Master's 1970 m 0 2012-11-11 2013-02-15 43880 61 34596 16
184827 HarvardX/PH207x/2012_Fall 1 1 0 0 Other Africa NaN 1986 m 0 2012-12-05 2012-12-26 15079 8 14523 2
245432 HarvardX/PH278x/2013_Spring 1 1 1 0 Other Africa Bachelor's 1982 m NaN 2013-02-16 2013-06-17 11916 12 11491 6
339149 MITx/3.091x/2012_Fall 1 1 1 1 United States NaN NaN NaN 0.9 2012-07-24 2013-07-23 61376 131 19179 16
339150 MITx/8.02x/2013_Spring 1 1 1 0 United States NaN NaN NaN 0.12 2013-01-17 2013-08-11 45660 139 15487 18
353229 MITx/6.00x/2012_Fall 1 1 0 0 Other South America NaN NaN NaN 0 2012-08-31 2012-12-19 197757 19 98517 5
440339 MITx/6.002x/2012_Fall 1 1 1 0 Other Africa Bachelor's 1983 m 0.03 2012-09-12 2012-12-24 28383 43 24564 16
473105 MITx/6.002x/2012_Fall 1 1 1 1 Spain Bachelor's 1993 m 0.71 2012-10-08 2013-01-03 16649 48 11458 15

unit 353229 deeply suspicious!

98517 video but only 5 chapters and 19 days of activity!

Clear as mud? Let's try another example. Can we pick out all the French?

In [40]:
df['final_cc_cname_DI']=="France"
Out[40]:
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7      True
8     False
9     False
10    False
11    False
12    False
13    False
14    False
...
641123    False
641124    False
641125    False
641126    False
641127    False
641128    False
641129    False
641130    False
641131    False
641132    False
641133    False
641134    False
641135    False
641136    False
641137    False
Name: final_cc_cname_DI, Length: 641138, dtype: bool

We can put the entire expression within our indices, so we can pick out just the rows involving France with:

In [41]:
france=df[df['final_cc_cname_DI']=="France"]

Stages of boolean indexing

  • select column whose values you're interested in, e.g. df['nchapters']
  • evaluate the result according to some test, e.g. df['nchapters']>0 -that produces a boolean series
  • use the boolean series to select the rows from the dataframe
    • df[df['nchapters]>0] results in a dataframe with only the rows that meet the test

Reflect on this confusing mess.

Much bigger than Pandas--using boolean operators is a standard piece of good python.

let's just explore the demographic data a big more

In [42]:
years=df['YoB'].value_counts()
In [43]:
years
Out[43]:
1990    38524
1992    37846
1991    37544
1989    36831
1988    34436
1993    33999
1987    31313
1986    28725
1985    25539
1994    23921
1984    22138
1983    18869
1982    16867
1981    14477
1980    13806
...
1939    86
1938    74
1937    64
2013    61
2002    45
1936    43
1935    36
2011    34
2010    17
2003    10
2008    10
2009     8
1931     7
2007     6
1934     5
Length: 78, dtype: int64
In [44]:
years.plot(kind="bar")
Out[44]:
<matplotlib.axes.AxesSubplot at 0x7ff7294fd0d0>

Looks reasonably cool, but wait. Not light yonder: the table has the dates out of order. We need to tell .value_counts not to sort!

Excuse me while I google how to do that.

In [45]:
years=df['YoB'].value_counts(sort=False)
In [46]:
years
Out[46]:
1931      7
1934      5
1935     36
1936     43
1937     64
1938     74
1939     86
1940     92
1941     96
1942    196
1943    223
1944    221
1945    216
1946    331
1947    460
...
1996    7754
1997    4082
1998    1935
1999     952
2000     340
2001     142
2002      45
2003      10
2007       6
2008      10
2009       8
2010      17
2011      34
2012     472
2013      61
Length: 78, dtype: int64
In [47]:
years.plot(kind="bar", title="Individual course enrollments by birth year")
Out[47]:
<matplotlib.axes.AxesSubplot at 0x7ff729ecb890>

What does the chart look like for India, rather than everywhere?

In [48]:
india=df['final_cc_cname_DI']=="India"

india_years=df[india]['YoB'].value_counts(sort=False)
In [49]:
india_years.plot(kind='bar', title="Indian enrollments by birth year")
Out[49]:
<matplotlib.axes.AxesSubplot at 0x7ff72d81a550>

Different levels of completion of course:

Documentation explains the different levels of completion of course.

"explored: administrative, 0/1; anyone who accessed at least half of the chapters in the courseware (chapters are the highest level on the “courseware” menu housing course content)."

In [50]:
explored=df[df['explored']==1]
In [51]:
len(explored)
Out[51]:
39686
In [52]:
years_explorers=explored['YoB'].value_counts(sort=False)
In [53]:
years.plot(kind="bar")
Out[53]:
<matplotlib.axes.AxesSubplot at 0x7ff72d64b8d0>
In [54]:
years_explorers.plot(kind='bar', title='Age of those who explored half the chapters')
Out[54]:
<matplotlib.axes.AxesSubplot at 0x7ff72a807090>

Are there geographic differences in the amount of completion of the courses?

In [55]:
france_certified=france[df['certified']!=0]
france_certified[:10]
/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py:1712: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  "DataFrame index.", UserWarning)
Out[55]:
course_id registered viewed explored certified final_cc_cname_DI LoE_DI YoB gender grade start_time_DI last_event_DI nevents ndays_act nplay_video nchapters
4111 HarvardX/CS50x/2012 1 1 1 1 France NaN NaN NaN 1 2012-09-26 2013-05-11 115 25 NaN 12
6318 HarvardX/CS50x/2012 1 1 1 1 France NaN NaN NaN 1 2012-07-26 2013-05-10 434 47 NaN 12
8138 HarvardX/CS50x/2012 1 1 1 1 France NaN NaN NaN 1 2012-07-28 2013-05-10 47 26 NaN 12
8343 HarvardX/PH207x/2012_Fall 1 1 1 1 France NaN NaN NaN 0.9 2012-07-25 2013-07-15 5497 53 1165 16
10974 HarvardX/CS50x/2012 1 1 1 1 France NaN NaN NaN 1 2012-07-29 2013-05-24 311 23 NaN 12
13590 HarvardX/CS50x/2012 1 1 1 1 France NaN NaN NaN 1 2012-07-24 2013-05-12 683 50 NaN 12
17323 HarvardX/PH278x/2013_Spring 1 1 1 1 France NaN NaN NaN 0.7 2012-12-25 2013-08-07 1416 29 48 9
53864 HarvardX/CS50x/2012 1 1 1 1 France Bachelor's 1992 m 1 2012-08-17 2013-08-09 342 73 NaN 12
73021 HarvardX/CS50x/2012 1 1 1 1 France Master's 1986 m 1 2012-08-30 2013-05-12 543 41 NaN 12
84967 HarvardX/ER22x/2013_Spring 1 1 1 1 France Bachelor's 1990 f 0.84 2012-12-21 2013-07-31 1198 16 NaN 30
In [56]:
len(france)  #how many?
Out[56]:
4700

What proportion of the French users finish the course by getting certified?

In [57]:
len(france_certified) 
Out[57]:
204
In [58]:
len(france_certified)/len(france)
Out[58]:
0

Zut!

Ça ne marche pas!

In [59]:
float(len(france_certified))/len(france)
Out[59]:
0.04340425531914894

So about 4.3% of self-reported French enrollees earn a certificate.

We could think about this as a probability.

We could write this:

P(that a French enrollee earns a certificate)=.043

Or,

P(Some person completes a certificate given that the person is French)

If A means "some person completes a certificate" and B and means "someone is French", then we write

$P(A|B)=.043$

The vertical bar (|) means "given that."

Can we figure out the probabilities for each country?

Can we do it without computing each individually as we did for France?

Oui, for pandas will let us do vector operations directly!

For France, we divided the total number of explorers over the total number of enrollees.

How can we do that division for every country all at once?

Let's compute the number of explorers and the number of enrollees for each country.

In [60]:
explored_by_country=explored['final_cc_cname_DI'].value_counts()
In [61]:
explored_by_country
Out[61]:
United States                             10132
India                                      7097
Other Europe                               3973
United Kingdom                             1759
Other Africa                               1621
Spain                                      1415
Russian Federation                         1135
Brazil                                      975
Unknown/Other                               965
Other Middle East/Central Asia              875
Canada                                      874
Other South Asia                            854
Germany                                     823
Poland                                      702
Other South America                         666
Greece                                      518
Ukraine                                     469
Australia                                   443
Nigeria                                     438
Pakistan                                    437
France                                      436
Colombia                                    418
Egypt                                       414
Mexico                                      375
Other North & Central Amer., Caribbean      313
Other East Asia                             304
Portugal                                    225
Indonesia                                   223
Philippines                                 219
China                                       176
Bangladesh                                  143
Morocco                                     132
Japan                                       112
Other Oceania                                25
dtype: int64
In [62]:
enrolled_by_country=df['final_cc_cname_DI'].value_counts()
In [63]:
enrolled_by_country
Out[63]:
United States                             184240
India                                      88696
Unknown/Other                              82029
Other Europe                               40377
Other Africa                               23897
United Kingdom                             22131
Brazil                                     17856
Other Middle East/Central Asia             17325
Other South Asia                           12992
Canada                                     12738
Pakistan                                   10824
Russian Federation                         10432
Spain                                      10003
Other South America                         9916
Egypt                                       9286
Germany                                     8074
Nigeria                                     7483
Other East Asia                             6446
Australia                                   6419
Mexico                                      5638
Philippines                                 5374
Poland                                      5226
China                                       5170
Greece                                      5162
Colombia                                    4803
France                                      4700
Other North & Central Amer., Caribbean      4434
Ukraine                                     4100
Morocco                                     3966
Indonesia                                   3410
Bangladesh                                  3182
Japan                                       2270
Portugal                                    2193
Other Oceania                                346
dtype: int64

And now to divide each country's exploration number by its enrollment number. We could do it with a for loop over all the countries. But we can vectorize.

In [64]:
explored_by_country/enrolled_by_country
Out[64]:
Australia                                 0.069014
Bangladesh                                0.044940
Brazil                                    0.054603
Canada                                    0.068614
China                                     0.034043
Colombia                                  0.087029
Egypt                                     0.044583
France                                    0.092766
Germany                                   0.101932
Greece                                    0.100349
India                                     0.080015
Indonesia                                 0.065396
Japan                                     0.049339
Mexico                                    0.066513
Morocco                                   0.033283
Nigeria                                   0.058533
Other Africa                              0.067833
Other East Asia                           0.047161
Other Europe                              0.098398
Other Middle East/Central Asia            0.050505
Other North & Central Amer., Caribbean    0.070591
Other Oceania                             0.072254
Other South America                       0.067164
Other South Asia                          0.065733
Pakistan                                  0.040373
Philippines                               0.040752
Poland                                    0.134328
Portugal                                  0.102599
Russian Federation                        0.108800
Spain                                     0.141458
Ukraine                                   0.114390
United Kingdom                            0.079481
United States                             0.054993
Unknown/Other                             0.011764
dtype: float64

Cool beans!

Think about what that division means: for each country, pick out the value in the explored vector for that country:

$A = \begin{pmatrix} exploration_{US} \\ exploration_{china} \\ exploration_{france} \end{pmatrix} $

and then pick out the value in the enrolled vector for that country:

$B = \begin{pmatrix} enrollment_{US} \\ enrollment_{china} \\ enrollment_{france} \end{pmatrix} $

Before we talked about different kinds of functions on vectors. Here we want corresponding terms to be divided.

$\begin{pmatrix} exploration_{US}/enrollment_{US} \\ exploration_{china}/enrollment_{china} \\ exploration_{france}/enrollment_{france} \\ \cdots \end{pmatrix}$

In [65]:
diligence_in_exploration=(explored_by_country/enrolled_by_country)
#Can we plot it? Yes!
In [66]:
diligence_in_exploration.plot(kind='bar', title='Diligence: Exploration per enrollment')
Out[66]:
<matplotlib.axes.AxesSubplot at 0x7ff72a5e1750>

Some nations appear to have a much higher rate of diligence in exploring chapters.

And there appears to be a positive correlation of diligence and being between 'Po' and 'Uk' in the alphabet.

Crucial Principle of Data Analysis

Correlation equals causation

In [67]:
from IPython.display import Image
Image("https://sslimgs.xkcd.com/comics/correlation.png")
    
Out[67]:
In [68]:
certified=df[df['certified']==1]
In [69]:
len(certified)
Out[69]:
17687
In [70]:
certified_by_country=certified['final_cc_cname_DI'].value_counts()
In [71]:
diligence=(certified_by_country/enrolled_by_country)
diligence.plot(kind='bar', title='Diligence redux: certification per enrollment')
Out[71]:
<matplotlib.axes.AxesSubplot at 0x7ff72a2b0510>

what if we compared completion by gender? or by age?

In [72]:
certified_male=certified[certified['gender']=='m']
certified_male_by_country=certified_male['final_cc_cname_DI'].value_counts()
In [73]:
enrolled_male=df[df['gender']=='m']
In [74]:
enrolled_male_by_country=enrolled_male['final_cc_cname_DI'].value_counts()
In [75]:
diligence_male=(certified_male_by_country/enrolled_male_by_country)
In [76]:
diligence_male.plot(kind='bar', title='Diligence: Certification per enrollment(dudes)')
Out[76]:
<matplotlib.axes.AxesSubplot at 0x7ff729f74f10>
In [77]:
certified_female=certified[certified['gender']=='f']
certified_female_by_country=certified_female['final_cc_cname_DI'].value_counts()
enrolled_female=df[df['gender']=='f']
enrolled_female_by_country=enrolled_female['final_cc_cname_DI'].value_counts()
diligence_female=(certified_female_by_country/enrolled_female_by_country)
diligence_female.plot(kind='bar', title='Diligence: Certification per enrollment(lasses)')
Out[77]:
<matplotlib.axes.AxesSubplot at 0x7ff729ceded0>
In [78]:
diligence_genders=pd.DataFrame([diligence_female, diligence_male], index=["female","male"])
In [79]:
diligence_genders
Out[79]:
Australia Bangladesh Brazil Canada China Colombia Egypt France Germany Greece ... Pakistan Philippines Poland Portugal Russian Federation Spain Ukraine United Kingdom United States Unknown/Other
female 0.040091 0.040984 0.021733 0.027979 0.022312 0.058757 0.019417 0.062500 0.076320 0.065415 ... 0.022563 0.010365 0.055306 0.084211 0.031308 0.090592 0.033624 0.044515 0.024095 0.000907
male 0.026940 0.009324 0.024028 0.022749 0.009840 0.032022 0.010848 0.039514 0.052922 0.055712 ... 0.012190 0.019212 0.084135 0.047189 0.060886 0.077693 0.055294 0.031626 0.019549 0.000643

2 rows × 34 columns

In [80]:
diligence_genders.T
Out[80]:
female male
Australia 0.040091 0.026940
Bangladesh 0.040984 0.009324
Brazil 0.021733 0.024028
Canada 0.027979 0.022749
China 0.022312 0.009840
Colombia 0.058757 0.032022
Egypt 0.019417 0.010848
France 0.062500 0.039514
Germany 0.076320 0.052922
Greece 0.065415 0.055712
India 0.045643 0.033067
Indonesia 0.041284 0.022642
Japan 0.008811 0.013754
Mexico 0.040107 0.021894
Morocco 0.015075 0.006401
Nigeria 0.035941 0.016881
Other Africa 0.041943 0.021750
Other East Asia 0.022238 0.019379
Other Europe 0.049466 0.041956
Other Middle East/Central Asia 0.027838 0.015398
Other North & Central Amer., Caribbean 0.046016 0.032689
Other Oceania 0.023810 NaN
Other South America 0.045977 0.023626
Other South Asia 0.035273 0.030398
Pakistan 0.022563 0.012190
Philippines 0.010365 0.019212
Poland 0.055306 0.084135
Portugal 0.084211 0.047189
Russian Federation 0.031308 0.060886
Spain 0.090592 0.077693
Ukraine 0.033624 0.055294
United Kingdom 0.044515 0.031626
United States 0.024095 0.019549
Unknown/Other 0.000907 0.000643
In [81]:
diligence_genders.T[["female","male"]].plot(kind="bar", title="Diligence: Certification per enrollment (lasses and dudes)")
Out[81]:
<matplotlib.axes.AxesSubplot at 0x7ff72a2a3fd0>
/usr/lib/pymodules/python2.7/matplotlib/font_manager.py:1246: UserWarning: findfont: Could not match :family=Bitstream Vera Sans:style=normal:variant=normal:weight=normal:stretch=normal:size=large. Returning /usr/share/matplotlib/mpl-data/fonts/ttf/cmb10.ttf
  UserWarning)
In [81]: