Number munging: vectors, Pandas, probabilities¶

In [1]:

# Render our plots inline
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

pd.set_option('display.mpl_style', 'default') # Make the graphs a bit prettier
plt.rcParams['figure.figsize'] = (15, 5)

In [2]:

#this presumes you've the two data sets locally

You can find the dataset and documentation at http://harvardx.harvard.edu/dataset

In [2]:

Our ritual: Exploratory data analysis¶

Exploratory data analysis (EDA) seeks to reveal structure, or simple descriptions, in data. We look at numbers and graphs and try to find patterns.

- Persi Diaconis, "Theories of Data Analysis: From Magical Thinking Through Classical statistics"

. . . proceeding via a ‘dustbowl’ empiricism is dangerous at worst and foolish at best . . . . The purely empirical approach is particularly dangerous in an age when computers and packaged programs are readily available, since there is temptation to substitute immediate empirical analysis for more analytic thought and theory building.

- Einhorn, “Alchemy in the Behavioral Sciences,” 1972

. . . we can view the techniques of EDA as a ritual designed to reveal patters in a data set. Thus, we may believe that naturally occurring data sets contain structure, that EDA is a useful vehicle for revealing the structure. . . . If we make no attempt to check whether the structure could have arisen by chance, and tend to accept the findinds as gospel, then the ritual comes close to magical thinking. ... a controlled form of magical thinking--in the guise of 'working hypothesis'--is a basic ingredient of scientific progress.

- Persi Diaconis, "Theories of Data Analysis: From Magical Thinking Through Classical statistics"

From data to databases to data mining¶

move from accessing and manipulating data to performing ever more complicated queries on our data

`Pandas` first-line `python` tool for EDA¶

rich data structures
powerful ways to slice, dice, reformate, fix, and eliminate data
- taste of what can do
rich queries like databases

`Pandas`: charismatic megafauna¶

In [3]:

import pandas as pd

Series¶

Use example of multiple average house prices by year by CPI.

If we wanted to understanding the fluctuations in house prices over time, we can use a new datatype called series.

In [4]:

CPI={"2010": 218.056, "2011": 224.939, "2012": 229.594, "2013": 232.957} #http://www.bls.gov/cpi/home.htm

The series method converts this dict to

In [5]:

CPI_series=pd.Series(CPI)

In [6]:

CPI_series

Out[6]:

2010    218.056
2011    224.939
2012    229.594
2013    232.957
dtype: float64

In [7]:

House_sale_mean={"2010": 100000, "2011": 100000, "2012": 100000, "2013": 100000}

In [8]:

House_sale_series=pd.Series(House_sale_mean)

In [9]:

House_sale_series

Out[9]:

2010    100000
2011    100000
2012    100000
2013    100000
dtype: int64

The CPI provides "a measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services." A higher number means it costs more to buy the same goods. It was set to 100 in 1982-4.

We can thus use it to measure the effects of inflation on the value of houses in our toy example.

In [10]:

(House_sale_series/CPI_series)

Out[10]:

2010    458.597791
2011    444.564971
2012    435.551452
2013    429.263770
dtype: float64

What's going here: pandas has taken the two series and divided the values that share the same index.

In [11]:

#If we multiply each one by 100, we'll get the value of our houses in 1982-4 dollars.
(House_sale_series/CPI_series)*100

Out[11]:

2010    45859.779139
2011    44456.497095
2012    43555.145169
2013    42926.376971
dtype: float64

In [12]:

inflation_adjusted=(House_sale_series/CPI_series)*100

In [13]:

#can perform calculations on individual 
inflation_adjusted['2013']/inflation_adjusted['2010']

Out[13]:

0.93603540567572574

In [14]:

#plotting is simple as pie
inflation_adjusted.plot()

Out[14]:

<matplotlib.axes.AxesSubplot at 0x7ff736b76a50>

/usr/lib/pymodules/python2.7/matplotlib/font_manager.py:1236: UserWarning: findfont: Font family ['monospace'] not found. Falling back to Bitstream Vera Sans
  (prop.get_family(), self.defaultFamily[fontext]))
/usr/lib/pymodules/python2.7/matplotlib/font_manager.py:1246: UserWarning: findfont: Could not match :family=Bitstream Vera Sans:style=normal:variant=normal:weight=normal:stretch=normal:size=medium. Returning /usr/share/matplotlib/mpl-data/fonts/ttf/cmb10.ttf
  UserWarning)

In [15]:

(House_sale_series/CPI_series).plot(title="Sorry, kids. Blame X, where X is current politician we don't like.")

Out[15]:

<matplotlib.axes.AxesSubplot at 0x7ff736b76f50>

/usr/lib/pymodules/python2.7/matplotlib/font_manager.py:1246: UserWarning: findfont: Could not match :family=Bitstream Vera Sans:style=normal:variant=normal:weight=normal:stretch=normal:size=x-large. Returning /usr/share/matplotlib/mpl-data/fonts/ttf/cmb10.ttf
  UserWarning)

What's super deceptive about this plot?

dataframes¶

The library Pandas provides us with a powerful overlay that lets us use matrices but always keep their row and column names, or as a spreadsheet on speed. It allows us to work directly with the datatype "Dataframes" that keeps track of values and their names for us. And it allows us to perform many operations on slices of the dataframe without having to run for loops and the like. This is more convenient and involves faster processing.

In [16]:

#Let's start with yet another way to read csv files, this time from `pandas`
df=pd.read_csv('./HMXPC13_DI_v2_5-14-14.csv', sep=",")

In [17]:

#take a look--it's a biggie
df

Out[17]:

	course_id	userid_DI	registered	viewed	explored	certified	final_cc_cname_DI	LoE_DI	YoB	gender	grade	start_time_DI	last_event_DI	nevents	ndays_act	nplay_video	nchapters	nforum_posts	roles	incomplete_flag
0	HarvardX/CB22x/2013_Spring	MHxPC130442623	1	0	0	0	United States	NaN	NaN	NaN	0	2012-12-19	2013-11-17	NaN	9	NaN	NaN	0	NaN	1
1	HarvardX/CS50x/2012	MHxPC130442623	1	1	0	0	United States	NaN	NaN	NaN	0	2012-10-15	NaN	NaN	9	NaN	1	0	NaN	1
2	HarvardX/CB22x/2013_Spring	MHxPC130275857	1	0	0	0	United States	NaN	NaN	NaN	0	2013-02-08	2013-11-17	NaN	16	NaN	NaN	0	NaN	1
3	HarvardX/CS50x/2012	MHxPC130275857	1	0	0	0	United States	NaN	NaN	NaN	0	2012-09-17	NaN	NaN	16	NaN	NaN	0	NaN	1
4	HarvardX/ER22x/2013_Spring	MHxPC130275857	1	0	0	0	United States	NaN	NaN	NaN	0	2012-12-19	NaN	NaN	16	NaN	NaN	0	NaN	1
5	HarvardX/PH207x/2012_Fall	MHxPC130275857	1	1	1	0	United States	NaN	NaN	NaN	0	2012-09-17	2013-05-23	502	16	50	12	0	NaN	NaN
6	HarvardX/PH278x/2013_Spring	MHxPC130275857	1	0	0	0	United States	NaN	NaN	NaN	0	2013-02-08	NaN	NaN	16	NaN	NaN	0	NaN	1
7	HarvardX/CB22x/2013_Spring	MHxPC130539455	1	1	0	0	France	NaN	NaN	NaN	0	2013-01-01	2013-05-14	42	6	NaN	3	0	NaN	NaN
8	HarvardX/CB22x/2013_Spring	MHxPC130088379	1	1	0	0	United States	NaN	NaN	NaN	0	2013-02-18	2013-03-17	70	3	NaN	3	0	NaN	NaN
9	HarvardX/CS50x/2012	MHxPC130088379	1	1	0	0	United States	NaN	NaN	NaN	0	2012-10-20	NaN	NaN	12	NaN	3	0	NaN	1
10	HarvardX/ER22x/2013_Spring	MHxPC130088379	1	1	0	0	United States	NaN	NaN	NaN	0	2013-02-23	2013-06-14	17	2	NaN	2	0	NaN	NaN
11	HarvardX/ER22x/2013_Spring	MHxPC130198098	1	1	0	0	United States	NaN	NaN	NaN	0	2013-06-17	2013-06-17	32	1	NaN	3	0	NaN	NaN
12	HarvardX/CB22x/2013_Spring	MHxPC130024894	1	1	0	0	United States	NaN	NaN	NaN	0.07	2013-01-24	2013-08-03	175	9	NaN	7	0	NaN	NaN
13	HarvardX/CS50x/2012	MHxPC130024894	1	1	0	0	United States	NaN	NaN	NaN	0	2013-06-27	NaN	NaN	2	NaN	2	0	NaN	1
14	HarvardX/ER22x/2013_Spring	MHxPC130024894	1	1	0	0	United States	NaN	NaN	NaN	0	2012-12-19	2013-08-17	78	5	NaN	4	0	NaN	NaN
15	HarvardX/PH207x/2012_Fall	MHxPC130024894	1	1	0	0	United States	NaN	NaN	NaN	0	2012-07-26	2013-01-16	75	14	5	2	0	NaN	NaN
16	HarvardX/PH278x/2013_Spring	MHxPC130024894	1	1	0	0	United States	NaN	NaN	NaN	0	2013-07-30	2013-08-27	11	2	2	1	0	NaN	NaN
17	HarvardX/CS50x/2012	MHxPC130080986	1	1	0	0	United States	NaN	NaN	NaN	0	2012-10-15	NaN	NaN	11	NaN	1	0	NaN	1
18	HarvardX/PH207x/2012_Fall	MHxPC130080986	1	1	0	0	United States	NaN	NaN	NaN	0	2012-10-25	2012-12-04	56	11	1	2	1	NaN	NaN
19	HarvardX/CS50x/2012	MHxPC130063375	1	1	0	0	Unknown/Other	NaN	NaN	NaN	0	2012-10-19	NaN	NaN	NaN	NaN	1	0	NaN	1
20	HarvardX/CS50x/2012	MHxPC130094371	1	1	0	0	United States	NaN	NaN	NaN	0	2013-03-03	2013-03-03	7	1	NaN	2	0	NaN	NaN
21	HarvardX/CS50x/2012	MHxPC130229084	1	1	0	0	Mexico	NaN	NaN	NaN	0	2012-10-15	NaN	NaN	NaN	NaN	1	0	NaN	1
22	HarvardX/CS50x/2012	MHxPC130300925	1	1	0	0	United States	NaN	NaN	NaN	0	2012-10-24	NaN	NaN	2	NaN	1	0	NaN	1
23	HarvardX/ER22x/2013_Spring	MHxPC130300925	1	1	0	0	United States	NaN	NaN	NaN	0	2012-12-20	2013-05-18	15	2	NaN	2	0	NaN	NaN
24	HarvardX/CS50x/2012	MHxPC130417650	1	1	0	0	Australia	NaN	NaN	NaN	0	2012-10-29	2013-03-04	1	1	NaN	2	0	NaN	NaN
25	HarvardX/CS50x/2012	MHxPC130506580	1	0	0	0	United States	NaN	NaN	NaN	0	2012-09-04	NaN	NaN	NaN	NaN	NaN	0	NaN	NaN
26	HarvardX/CS50x/2012	MHxPC130298257	1	0	0	0	United States	NaN	NaN	NaN	0	2012-09-05	NaN	NaN	NaN	NaN	3	0	NaN	1
27	HarvardX/CS50x/2012	MHxPC130500569	1	1	0	0	United States	NaN	NaN	NaN	0	2012-10-22	2013-03-30	6	1	NaN	5	0	NaN	NaN
28	HarvardX/CS50x/2012	MHxPC130466479	1	1	0	0	Unknown/Other	NaN	NaN	NaN	0	2013-01-07	NaN	NaN	NaN	NaN	1	0	NaN	1
29	HarvardX/CB22x/2013_Spring	MHxPC130340959	1	1	0	0	United States	NaN	NaN	NaN	0.05	2013-02-11	2013-04-06	285	8	NaN	4	0	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
641108	MITx/6.002x/2013_Spring	MHxPC130140735	1	1	0	0	United States	Bachelor's	1991	m	NaN	2013-09-07	2013-09-07	59	1	5	3	0	NaN	NaN
641109	MITx/6.00x/2013_Spring	MHxPC130493130	1	0	0	0	United Kingdom	Master's	1977	m	NaN	2013-09-07	NaN	NaN	NaN	NaN	2	0	NaN	1
641110	MITx/6.00x/2013_Spring	MHxPC130400592	1	1	0	0	Other Europe	Secondary	1992	m	NaN	2013-09-07	2013-09-07	395	1	51	4	0	NaN	NaN
641111	MITx/6.00x/2013_Spring	MHxPC130109892	1	1	0	0	India	Secondary	1995	m	NaN	2013-09-07	2013-09-07	49	1	14	2	0	NaN	NaN
641112	MITx/14.73x/2013_Spring	MHxPC130183007	1	0	0	0	India	Master's	1985	m	NaN	2013-09-07	NaN	NaN	NaN	NaN	NaN	0	NaN	NaN
641113	MITx/8.MReV/2013_Summer	MHxPC130261281	1	1	0	0	India	Secondary	1994	m	0	2013-09-07	2013-09-07	8	1	NaN	1	0	NaN	NaN
641114	MITx/6.00x/2013_Spring	MHxPC130481990	1	1	0	0	India	Bachelor's	1989	m	NaN	2013-09-07	2013-09-07	22	1	5	1	0	NaN	NaN
641115	MITx/6.00x/2013_Spring	MHxPC130528581	1	0	0	0	United States	Bachelor's	1990	f	NaN	2013-09-07	2013-09-07	2	1	NaN	3	0	NaN	NaN
641116	MITx/14.73x/2013_Spring	MHxPC130555418	1	0	0	0	Unknown/Other	Bachelor's	1988	m	NaN	2013-09-07	NaN	NaN	NaN	NaN	NaN	0	NaN	NaN
641117	MITx/6.002x/2013_Spring	MHxPC130408810	1	0	0	0	India	Secondary	1993	m	NaN	2013-09-07	2013-09-07	2	1	NaN	3	0	NaN	NaN
641118	MITx/6.00x/2013_Spring	MHxPC130040184	1	0	0	0	United States	Secondary	1991	m	NaN	2013-09-07	NaN	NaN	NaN	NaN	NaN	0	NaN	NaN
641119	MITx/6.002x/2013_Spring	MHxPC130566049	1	0	0	0	Other Europe	Master's	1982	m	NaN	2013-09-07	2013-09-07	2	1	NaN	2	0	NaN	NaN
641120	MITx/8.MReV/2013_Summer	MHxPC130374105	1	1	0	0	India	Bachelor's	1992	m	0	2013-09-07	2013-09-07	49	1	NaN	1	0	NaN	NaN
641121	MITx/6.00x/2013_Spring	MHxPC130282999	1	0	0	0	Other Europe	Master's	1979	m	NaN	2013-09-07	NaN	NaN	NaN	NaN	7	0	NaN	1
641122	MITx/8.MReV/2013_Summer	MHxPC130556398	1	0	0	0	India	Bachelor's	1985	m	0	2013-09-07	2013-09-07	1	1	NaN	NaN	0	NaN	NaN
641123	MITx/6.00x/2013_Spring	MHxPC130573334	1	0	0	0	Spain	Bachelor's	1989	m	NaN	2013-09-07	2013-09-07	1	1	NaN	NaN	0	NaN	NaN
641124	MITx/6.00x/2013_Spring	MHxPC130505931	1	1	0	0	India	Secondary	1995	m	NaN	2013-09-07	2013-09-07	59	1	NaN	2	0	NaN	NaN
641125	MITx/6.002x/2013_Spring	MHxPC130280976	1	0	0	0	United States	Bachelor's	NaN	m	NaN	2013-09-07	2013-09-07	2	1	NaN	NaN	0	NaN	NaN
641126	MITx/6.00x/2013_Spring	MHxPC130137331	1	1	0	0	United States	Secondary	1992	m	NaN	2013-09-07	2013-09-07	251	1	77	4	0	NaN	NaN
641127	MITx/6.002x/2013_Spring	MHxPC130271624	1	0	0	0	India	Bachelor's	1989	m	NaN	2013-09-07	2013-09-07	1	1	NaN	NaN	0	NaN	NaN
641128	MITx/14.73x/2013_Spring	MHxPC130256541	1	1	0	0	United States	Master's	1982	m	NaN	2013-09-07	2013-09-07	51	1	1	1	0	NaN	NaN
641129	MITx/6.00x/2013_Spring	MHxPC130021638	1	0	0	0	Unknown/Other	Bachelor's	1988	m	NaN	2013-09-07	NaN	NaN	NaN	NaN	NaN	0	NaN	NaN
641130	MITx/14.73x/2013_Spring	MHxPC130591057	1	0	0	0	Canada	Bachelor's	NaN	f	NaN	2013-09-07	2013-09-07	6	1	NaN	NaN	0	NaN	NaN
641131	MITx/8.02x/2013_Spring	MHxPC130226305	1	0	0	0	Unknown/Other	Bachelor's	1988	m	NaN	2013-09-07	2013-09-07	11	1	NaN	2	0	NaN	NaN
641132	MITx/6.002x/2013_Spring	MHxPC130030805	1	1	0	0	Pakistan	Master's	1989	m	NaN	2013-09-07	2013-09-07	29	1	NaN	1	0	NaN	NaN
641133	MITx/6.00x/2013_Spring	MHxPC130184108	1	1	0	0	Canada	Bachelor's	1991	m	NaN	2013-09-07	2013-09-07	97	1	4	2	0	NaN	NaN
641134	MITx/6.00x/2013_Spring	MHxPC130359782	1	0	0	0	Other Europe	Bachelor's	1991	f	NaN	2013-09-07	2013-09-07	1	1	NaN	NaN	0	NaN	NaN
641135	MITx/6.002x/2013_Spring	MHxPC130098513	1	0	0	0	United States	Doctorate	1979	m	NaN	2013-09-07	2013-09-07	1	1	NaN	NaN	0	NaN	NaN
641136	MITx/6.00x/2013_Spring	MHxPC130098513	1	1	0	0	United States	Doctorate	1979	m	NaN	2013-09-07	2013-09-07	74	1	14	1	0	NaN	NaN
641137	MITx/8.02x/2013_Spring	MHxPC130098513	1	0	0	0	United States	Doctorate	1979	m	NaN	2013-09-07	NaN	NaN	1	NaN	NaN	0	NaN	1

641138 rows × 20 columns

Note at the bottom that the display tells us how many rows and columns we're dealing with.

As a general rule, pandas dataframe objects default to slicing by column using a syntax you'll know from dicts as in df["course_id"].

In [18]:

df["course_id"]

Out[18]:

0      HarvardX/CB22x/2013_Spring
1             HarvardX/CS50x/2012
2      HarvardX/CB22x/2013_Spring
3             HarvardX/CS50x/2012
4      HarvardX/ER22x/2013_Spring
5       HarvardX/PH207x/2012_Fall
6     HarvardX/PH278x/2013_Spring
7      HarvardX/CB22x/2013_Spring
8      HarvardX/CB22x/2013_Spring
9             HarvardX/CS50x/2012
10     HarvardX/ER22x/2013_Spring
11     HarvardX/ER22x/2013_Spring
12     HarvardX/CB22x/2013_Spring
13            HarvardX/CS50x/2012
14     HarvardX/ER22x/2013_Spring
...
641123     MITx/6.00x/2013_Spring
641124     MITx/6.00x/2013_Spring
641125    MITx/6.002x/2013_Spring
641126     MITx/6.00x/2013_Spring
641127    MITx/6.002x/2013_Spring
641128    MITx/14.73x/2013_Spring
641129     MITx/6.00x/2013_Spring
641130    MITx/14.73x/2013_Spring
641131     MITx/8.02x/2013_Spring
641132    MITx/6.002x/2013_Spring
641133     MITx/6.00x/2013_Spring
641134     MITx/6.00x/2013_Spring
641135    MITx/6.002x/2013_Spring
641136     MITx/6.00x/2013_Spring
641137     MITx/8.02x/2013_Spring
Name: course_id, Length: 641138, dtype: object

We can select some subset of rows as you might expect:

In [19]:

df["course_id"][3340:3350]

Out[19]:

3340            HarvardX/CS50x/2012
3341     HarvardX/ER22x/2013_Spring
3342    HarvardX/PH278x/2013_Spring
3343            HarvardX/CS50x/2012
3344            HarvardX/CS50x/2012
3345     HarvardX/ER22x/2013_Spring
3346            HarvardX/CS50x/2012
3347     HarvardX/CB22x/2013_Spring
3348            HarvardX/CS50x/2012
3349            HarvardX/CS50x/2012
Name: course_id, dtype: object

In [20]:

df[3340:3350]

Out[20]:

	course_id	userid_DI	registered	viewed	final_cc_cname_DI	LoE_DI	YoB	gender	grade	start_time_DI	last_event_DI	nevents	ndays_act	nplay_video	nchapters	roles	incomplete_flag
3340	HarvardX/CS50x/2012	MHxPC130386705	1	1	Russian Federation	NaN	NaN	NaN	0	2012-08-17	NaN	NaN	NaN	NaN	2	NaN	1
3341	HarvardX/ER22x/2013_Spring	MHxPC130432757	1	1	United States	NaN	NaN	NaN	NaN	2013-09-05	2013-09-05	16	1	NaN	2	NaN	NaN
3342	HarvardX/PH278x/2013_Spring	MHxPC130432757	1	0	United States	NaN	NaN	NaN	NaN	2012-12-25	NaN	NaN	1	NaN	NaN	NaN	1
3343	HarvardX/CS50x/2012	MHxPC130382204	1	1	Ukraine	NaN	NaN	NaN	0	2012-11-30	NaN	NaN	NaN	NaN	5	NaN	1
3344	HarvardX/CS50x/2012	MHxPC130142047	1	1	Spain	NaN	NaN	NaN	0.0	2013-07-12	2013-07-12	8	1	NaN	1	NaN	NaN
3345	HarvardX/ER22x/2013_Spring	MHxPC130191600	1	0	India	NaN	NaN	NaN	0	2012-12-23	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3346	HarvardX/CS50x/2012	MHxPC130079233	1	0	United States	NaN	NaN	NaN	0	2012-08-17	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3347	HarvardX/CB22x/2013_Spring	MHxPC130277592	1	1	United States	NaN	NaN	NaN	0.04	2013-01-23	2013-04-04	333	8	NaN	4	NaN	NaN
3348	HarvardX/CS50x/2012	MHxPC130429812	1	1	United Kingdom	NaN	NaN	NaN	0	2012-08-18	NaN	NaN	NaN	NaN	2	NaN	1
3349	HarvardX/CS50x/2012	MHxPC130503405	1	0	Russian Federation	NaN	NaN	NaN	0.0	2012-09-03	NaN	NaN	NaN	NaN	NaN	NaN	NaN

In [21]:

df[666]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-21-342ddc6d1cab> in <module>()
----> 1 df[666]

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1682             return self._getitem_multilevel(key)
   1683         else:
-> 1684             return self._getitem_column(key)
   1685 
   1686     def _getitem_column(self, key):

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _getitem_column(self, key)
   1689         # get column
   1690         if self.columns.is_unique:
-> 1691             return self._get_item_cache(key)
   1692 
   1693         # duplicate columns & possible reduce dimensionaility

/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
   1050         res = cache.get(item)
   1051         if res is None:
-> 1052             values = self._data.get(item)
   1053             res = self._box_item_values(item, values)
   1054             cache[item] = res

/usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in get(self, item)
   2535 
   2536             if not isnull(item):
-> 2537                 loc = self.items.get_loc(item)
   2538             else:
   2539                 indexer = np.arange(len(self.items))[isnull(self.items)]

/usr/local/lib/python2.7/dist-packages/pandas/core/index.pyc in get_loc(self, key)
   1154         loc : int if unique index, possibly slice or mask if not
   1155         """
-> 1156         return self._engine.get_loc(_values_from_object(key))
   1157 
   1158     def get_value(self, series, key):

/usr/local/lib/python2.7/dist-packages/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3353)()

/usr/local/lib/python2.7/dist-packages/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3233)()

/usr/local/lib/python2.7/dist-packages/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11148)()

/usr/local/lib/python2.7/dist-packages/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11101)()

KeyError: 666

Huh? I have row 666! pandas doesn't let you select a single row directly. Instead, use the .ix method.

In [22]:

df.ix[666]

Out[22]:

course_id            HarvardX/CS50x/2012
userid_DI                 MHxPC130297337
registered                             1
viewed                                 0
explored                               0
certified                              0
final_cc_cname_DI         United Kingdom
LoE_DI                               NaN
YoB                                  NaN
gender                               NaN
grade                                  0
start_time_DI                 2012-08-17
last_event_DI                        NaN
nevents                              NaN
ndays_act                            NaN
nplay_video                          NaN
nchapters                            NaN
nforum_posts                           0
roles                                NaN
incomplete_flag                      NaN
Name: 666, dtype: object

Why? A good question. Now try passing a list:

In [23]:

df.ix[[666]]

Out[23]:

	course_id	userid_DI	registered	viewed	explored	certified	final_cc_cname_DI	LoE_DI	YoB	gender	grade	start_time_DI	last_event_DI	nevents	ndays_act	nplay_video	nchapters	nforum_posts	roles	incomplete_flag
666	HarvardX/CS50x/2012	MHxPC130297337	1	0	0	0	United Kingdom	NaN	NaN	NaN	0	2012-08-17	NaN	NaN	NaN	NaN	NaN	0	NaN	NaN

What types of data did pandas import from the csv?

In [24]:

df.dtypes

Out[24]:

course_id             object
userid_DI             object
registered             int64
viewed                 int64
explored               int64
certified              int64
final_cc_cname_DI     object
LoE_DI                object
YoB                  float64
gender                object
grade                 object
start_time_DI         object
last_event_DI         object
nevents              float64
ndays_act            float64
nplay_video          float64
nchapters            float64
nforum_posts           int64
roles                float64
incomplete_flag      float64
dtype: object

In inputing CSV, Pandas parses each column and attempts to discern what sort of data is within. It's good but not infallible.

Pandas is particularly good with dates: you simple tell it which columns to parse as dates.

Let's refine our reading of the CSV to parse the dates.

In [25]:

df=pd.read_csv('HMXPC13_DI_v2_5-14-14.csv', sep="," , parse_dates=['start_time_DI', 'last_event_DI'])

We pass a list of the columns to consider as dates.

In [26]:

df.dtypes

Out[26]:

course_id                    object
userid_DI                    object
registered                    int64
viewed                        int64
explored                      int64
certified                     int64
final_cc_cname_DI            object
LoE_DI                       object
YoB                         float64
gender                       object
grade                        object
start_time_DI        datetime64[ns]
last_event_DI        datetime64[ns]
nevents                     float64
ndays_act                   float64
nplay_video                 float64
nchapters                   float64
nforum_posts                  int64
roles                       float64
incomplete_flag             float64
dtype: object

We don't need all those fields. We can cull them by name using the method drop.

In [27]:

df=df.drop(['userid_DI', 'roles', 'incomplete_flag', 'nforum_posts'], axis=1) # axis=1 means the column names, not the rows

We can pick out columns using their names--just like with dicts:

In [28]:

df['final_cc_cname_DI'][100:110]

Out[28]:

100         United States
101         United States
102         United States
103         United States
104    Russian Federation
105    Russian Federation
106         United States
107         United States
108         United States
109         United States
Name: final_cc_cname_DI, dtype: object

We can easily produce new dataframes that retain the relevant indices.

In [29]:

df[['gender', 'ndays_act','nplay_video']][1781:1787] 
#note the double [[]]--you're providing a list [x,y. . . ] of the columns you want

Out[29]:

	gender	ndays_act	nplay_video
1781	NaN	NaN	NaN
1782	NaN	1	NaN
1783	NaN	2	NaN
1784	NaN	3	NaN
1785	NaN	1	NaN
1786	NaN	1	NaN

"NaN" is not a English grandmother. It is an unreported or empty value.

Often we might treat it as a zero, but we need to be careful.

Would it matter in the case of nplay_video?

How about gender?

Vectors, matrices, and functions, oh my!¶

You've been writing functions and methods for the past couple of weeks. You'll recall functions like $f(x)$ and $g(x)$ from your previous math training.

Functions can be defined on vector and matrix qualities.

For our example, just consider the vector from the intersection of the x and y axes to a point (a,b), which we might write $\bar{v}=[a,b]$.

Some take a vector and return a non-vectoral number, called a scalar. For example, if we have the point (2,2) on the x and y axes, the length of the vector [2,2] from the origin to that point can be found with a standard distance function

$d([x,y])=|\sqrt{x^2+y^2}|$

$d([2,2])=|\sqrt{2^2+2^2}|=|\sqrt{8}|$

Or, more easily, we might compute the mean of the elements of a column in a vector.

Some functions take a vector and return the same kind of vector.

$f([x,y])=[2x,3y]$

$f([1,2])=[2,6]$

For our purposes

function drawing upon all the parts of a vector or matrix (a mean)
function applied to each element individually
function applied to a particular slice of the data, or a systematic slice of the data

It's pretty easy for us to think of vectors in 2 or 3 dimensions. In much data analysis, we'll use many more dimensions. Consider just one of our rows in our dataframe today:

In [29]:

`Pandas` makes it supremely easy to plot time series¶

we had it parse the dates in two columns above

In [30]:

df['start_time_DI'] # This produces a series of start times

Out[30]:

0    2012-12-19
1    2012-10-15
2    2013-02-08
3    2012-09-17
4    2012-12-19
5    2012-09-17
6    2013-02-08
7    2013-01-01
8    2013-02-18
9    2012-10-20
10   2013-02-23
11   2013-06-17
12   2013-01-24
13   2013-06-27
14   2012-12-19
...
641123   2013-09-07
641124   2013-09-07
641125   2013-09-07
641126   2013-09-07
641127   2013-09-07
641128   2013-09-07
641129   2013-09-07
641130   2013-09-07
641131   2013-09-07
641132   2013-09-07
641133   2013-09-07
641134   2013-09-07
641135   2013-09-07
641136   2013-09-07
641137   2013-09-07
Name: start_time_DI, Length: 641138, dtype: datetime64[ns]

What if we wanted to figure out how many people started each day? We could run a for loop and compute it ourslves.

But panda has got this cover.

We use the .value_counts() method. It sums the number of times a each given value in the data occurs in a series.

In [31]:

startdates=df['start_time_DI'].value_counts()

In [32]:

startdates

Out[32]:

2012-08-17    10165
2013-01-23     8368
2012-10-15     6766
2012-08-16     6369
2012-12-20     5858
2013-02-14     5810
2012-12-21     5809
2012-08-18     5531
2012-08-13     5247
2013-03-03     5053
2012-10-16     4639
2012-07-24     4635
2013-02-15     4436
2013-01-22     4263
2012-08-20     4107
...
2013-07-15    396
2013-07-18    390
2013-07-10    386
2013-07-20    378
2013-07-09    374
2013-07-08    365
2013-07-04    357
2013-07-12    334
2013-07-05    307
2013-07-14    279
2013-07-13    275
2013-07-06    274
2013-07-07    273
2012-07-23      5
2013-09-08      1
Length: 413

In [33]:

startdates.plot()

Out[33]:

<matplotlib.axes.AxesSubplot at 0x7ff72dce9350>

How much video did each person watch?¶

In [34]:

df['nplay_video'].dropna().plot()

Out[34]:

<matplotlib.axes.AxesSubplot at 0x7ff72dbb63d0>

Who are the frickin' overachievers? We'll investigate in just a sec.

How about: how many people enrolled per country?

work with your BFF for 2 minutes to figure out
1. how to pick out the field including country
2. how to count them

In [35]:

country=df['final_cc_cname_DI'].value_counts()

In [36]:

country

Out[36]:

United States                             184240
India                                      88696
Unknown/Other                              82029
Other Europe                               40377
Other Africa                               23897
United Kingdom                             22131
Brazil                                     17856
Other Middle East/Central Asia             17325
Other South Asia                           12992
Canada                                     12738
Pakistan                                   10824
Russian Federation                         10432
Spain                                      10003
Other South America                         9916
Egypt                                       9286
Germany                                     8074
Nigeria                                     7483
Other East Asia                             6446
Australia                                   6419
Mexico                                      5638
Philippines                                 5374
Poland                                      5226
China                                       5170
Greece                                      5162
Colombia                                    4803
France                                      4700
Other North & Central Amer., Caribbean      4434
Ukraine                                     4100
Morocco                                     3966
Indonesia                                   3410
Bangladesh                                  3182
Japan                                       2270
Portugal                                    2193
Other Oceania                                346
dtype: int64

In [37]:

country[:15].plot(kind='bar')

Out[37]:

<matplotlib.axes.AxesSubplot at 0x7ff728c11a90>

Selecting data: boolean indexing¶

Often we want to pick out all the rows that have certain attributes from the entire dataframe: those belonging to French people, or all the self-identified women, or the overachievers.

To do this create a vector where each element indicates whether each row satisfies the condition. It looks like a long list of trues and falses, so it's called a boolean vector.

In [38]:

#pick out all the people who watched over 10000[!] minutes of video
df['nplay_video']>10000

Out[38]:

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
...
641123    False
641124    False
641125    False
641126    False
641127    False
641128    False
641129    False
641130    False
641131    False
641132    False
641133    False
641134    False
641135    False
641136    False
641137    False
Name: nplay_video, Length: 641138, dtype: bool

In [39]:

#use that long vector of trues and falses to pick out just those rows
df[df['nplay_video']>10000]

Out[39]:

	course_id	registered	viewed	explored	certified	final_cc_cname_DI	LoE_DI	YoB	gender	grade	start_time_DI	last_event_DI	nevents	ndays_act	nplay_video	nchapters
83088	HarvardX/PH207x/2012_Fall	1	1	1	1	Other Africa	NaN	1986	m	0.93	2012-09-07	2013-01-18	15182	34	10668	15
130094	HarvardX/PH207x/2012_Fall	1	1	1	1	Egypt	Bachelor's	1988	f	0.92	2012-10-13	2013-05-19	33820	98	12983	16
136405	HarvardX/PH207x/2012_Fall	1	1	1	1	Portugal	Master's	1986	m	0.94	2012-10-15	2013-01-20	37440	71	17314	16
170632	HarvardX/PH207x/2012_Fall	1	1	1	0	India	Master's	1970	m	0	2012-11-11	2013-02-15	43880	61	34596	16
184827	HarvardX/PH207x/2012_Fall	1	1	0	0	Other Africa	NaN	1986	m	0	2012-12-05	2012-12-26	15079	8	14523	2
245432	HarvardX/PH278x/2013_Spring	1	1	1	0	Other Africa	Bachelor's	1982	m	NaN	2013-02-16	2013-06-17	11916	12	11491	6
339149	MITx/3.091x/2012_Fall	1	1	1	1	United States	NaN	NaN	NaN	0.9	2012-07-24	2013-07-23	61376	131	19179	16
339150	MITx/8.02x/2013_Spring	1	1	1	0	United States	NaN	NaN	NaN	0.12	2013-01-17	2013-08-11	45660	139	15487	18
353229	MITx/6.00x/2012_Fall	1	1	0	0	Other South America	NaN	NaN	NaN	0	2012-08-31	2012-12-19	197757	19	98517	5
440339	MITx/6.002x/2012_Fall	1	1	1	0	Other Africa	Bachelor's	1983	m	0.03	2012-09-12	2012-12-24	28383	43	24564	16
473105	MITx/6.002x/2012_Fall	1	1	1	1	Spain	Bachelor's	1993	m	0.71	2012-10-08	2013-01-03	16649	48	11458	15

unit 353229 deeply suspicious!

98517 video but only 5 chapters and 19 days of activity!

Clear as mud? Let's try another example. Can we pick out all the French?

In [40]:

df['final_cc_cname_DI']=="France"

Out[40]:

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7      True
8     False
9     False
10    False
11    False
12    False
13    False
14    False
...
641123    False
641124    False
641125    False
641126    False
641127    False
641128    False
641129    False
641130    False
641131    False
641132    False
641133    False
641134    False
641135    False
641136    False
641137    False
Name: final_cc_cname_DI, Length: 641138, dtype: bool

We can put the entire expression within our indices, so we can pick out just the rows involving France with:

In [41]:

france=df[df['final_cc_cname_DI']=="France"]

Stages of boolean indexing¶

select column whose values you're interested in, e.g. df['nchapters']
evaluate the result according to some test, e.g. df['nchapters']>0 -that produces a boolean series
use the boolean series to select the rows from the dataframe
- df[df['nchapters]>0]
results in a dataframe with only the rows that meet the test

Reflect on this confusing mess.¶

Much bigger than Pandas--using boolean operators is a standard piece of good python.¶

let's just explore the demographic data a big more¶

In [42]:

years=df['YoB'].value_counts()

In [43]:

years

Out[43]:

1990    38524
1992    37846
1991    37544
1989    36831
1988    34436
1993    33999
1987    31313
1986    28725
1985    25539
1994    23921
1984    22138
1983    18869
1982    16867
1981    14477
1980    13806
...
1939    86
1938    74
1937    64
2013    61
2002    45
1936    43
1935    36
2011    34
2010    17
2003    10
2008    10
2009     8
1931     7
2007     6
1934     5
Length: 78, dtype: int64

In [44]:

years.plot(kind="bar")

Out[44]:

<matplotlib.axes.AxesSubplot at 0x7ff7294fd0d0>

Looks reasonably cool, but wait. Not light yonder: the table has the dates out of order. We need to tell .value_counts not to sort!

Excuse me while I google how to do that.

In [45]:

years=df['YoB'].value_counts(sort=False)

In [46]:

years

Out[46]:

1931      7
1934      5
1935     36
1936     43
1937     64
1938     74
1939     86
1940     92
1941     96
1942    196
1943    223
1944    221
1945    216
1946    331
1947    460
...
1996    7754
1997    4082
1998    1935
1999     952
2000     340
2001     142
2002      45
2003      10
2007       6
2008      10
2009       8
2010      17
2011      34
2012     472
2013      61
Length: 78, dtype: int64

In [47]:

years.plot(kind="bar", title="Individual course enrollments by birth year")

Out[47]:

<matplotlib.axes.AxesSubplot at 0x7ff729ecb890>

What does the chart look like for India, rather than everywhere?

In [48]:

india=df['final_cc_cname_DI']=="India"

india_years=df[india]['YoB'].value_counts(sort=False)

In [49]:

india_years.plot(kind='bar', title="Indian enrollments by birth year")

Out[49]:

<matplotlib.axes.AxesSubplot at 0x7ff72d81a550>

Different levels of completion of course:¶

Documentation explains the different levels of completion of course.

"explored: administrative, 0/1; anyone who accessed at least half of the chapters in the courseware (chapters are the highest level on the “courseware” menu housing course content)."

In [50]:

explored=df[df['explored']==1]

In [51]:

len(explored)

Out[51]:

In [52]:

years_explorers=explored['YoB'].value_counts(sort=False)

In [53]:

years.plot(kind="bar")

Out[53]:

<matplotlib.axes.AxesSubplot at 0x7ff72d64b8d0>

In [54]:

years_explorers.plot(kind='bar', title='Age of those who explored half the chapters')

Out[54]:

<matplotlib.axes.AxesSubplot at 0x7ff72a807090>

Are there geographic differences in the amount of completion of the courses?

In [55]:

france_certified=france[df['certified']!=0]
france_certified[:10]

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py:1712: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  "DataFrame index.", UserWarning)

Out[55]:

	course_id	registered	viewed	explored	certified	final_cc_cname_DI	LoE_DI	YoB	gender	grade	start_time_DI	last_event_DI	nevents	ndays_act	nplay_video	nchapters
4111	HarvardX/CS50x/2012	1	1	1	1	France	NaN	NaN	NaN	1	2012-09-26	2013-05-11	115	25	NaN	12
6318	HarvardX/CS50x/2012	1	1	1	1	France	NaN	NaN	NaN	1	2012-07-26	2013-05-10	434	47	NaN	12
8138	HarvardX/CS50x/2012	1	1	1	1	France	NaN	NaN	NaN	1	2012-07-28	2013-05-10	47	26	NaN	12
8343	HarvardX/PH207x/2012_Fall	1	1	1	1	France	NaN	NaN	NaN	0.9	2012-07-25	2013-07-15	5497	53	1165	16
10974	HarvardX/CS50x/2012	1	1	1	1	France	NaN	NaN	NaN	1	2012-07-29	2013-05-24	311	23	NaN	12
13590	HarvardX/CS50x/2012	1	1	1	1	France	NaN	NaN	NaN	1	2012-07-24	2013-05-12	683	50	NaN	12
17323	HarvardX/PH278x/2013_Spring	1	1	1	1	France	NaN	NaN	NaN	0.7	2012-12-25	2013-08-07	1416	29	48	9
53864	HarvardX/CS50x/2012	1	1	1	1	France	Bachelor's	1992	m	1	2012-08-17	2013-08-09	342	73	NaN	12
73021	HarvardX/CS50x/2012	1	1	1	1	France	Master's	1986	m	1	2012-08-30	2013-05-12	543	41	NaN	12
84967	HarvardX/ER22x/2013_Spring	1	1	1	1	France	Bachelor's	1990	f	0.84	2012-12-21	2013-07-31	1198	16	NaN	30

In [56]:

len(france)  #how many?

Out[56]:

What proportion of the French users finish the course by getting certified?

In [57]:

len(france_certified) 

Out[57]:

In [58]:

len(france_certified)/len(france)

Out[58]:

Zut!¶

Ça ne marche pas!

In [59]:

float(len(france_certified))/len(france)

Out[59]:

0.04340425531914894

So about 4.3% of self-reported French enrollees earn a certificate.

We could think about this as a probability.¶

We could write this:

P(that a French enrollee earns a certificate)=.043

Or,

P(Some person completes a certificate given that the person is French)

If A means "some person completes a certificate" and B and means "someone is French", then we write

$P(A|B)=.043$

The vertical bar (|) means "given that."

Can we figure out the probabilities for each country?¶

Can we do it without computing each individually as we did for France?

Oui, for pandas will let us do vector operations directly!

For France, we divided the total number of explorers over the total number of enrollees.

How can we do that division for every country all at once?

Let's compute the number of explorers and the number of enrollees for each country.

In [60]:

explored_by_country=explored['final_cc_cname_DI'].value_counts()

In [61]:

explored_by_country

Out[61]:

United States                             10132
India                                      7097
Other Europe                               3973
United Kingdom                             1759
Other Africa                               1621
Spain                                      1415
Russian Federation                         1135
Brazil                                      975
Unknown/Other                               965
Other Middle East/Central Asia              875
Canada                                      874
Other South Asia                            854
Germany                                     823
Poland                                      702
Other South America                         666
Greece                                      518
Ukraine                                     469
Australia                                   443
Nigeria                                     438
Pakistan                                    437
France                                      436
Colombia                                    418
Egypt                                       414
Mexico                                      375
Other North & Central Amer., Caribbean      313
Other East Asia                             304
Portugal                                    225
Indonesia                                   223
Philippines                                 219
China                                       176
Bangladesh                                  143
Morocco                                     132
Japan                                       112
Other Oceania                                25
dtype: int64

In [62]:

enrolled_by_country=df['final_cc_cname_DI'].value_counts()

In [63]:

enrolled_by_country

Out[63]:

United States                             184240
India                                      88696
Unknown/Other                              82029
Other Europe                               40377
Other Africa                               23897
United Kingdom                             22131
Brazil                                     17856
Other Middle East/Central Asia             17325
Other South Asia                           12992
Canada                                     12738
Pakistan                                   10824
Russian Federation                         10432
Spain                                      10003
Other South America                         9916
Egypt                                       9286
Germany                                     8074
Nigeria                                     7483
Other East Asia                             6446
Australia                                   6419
Mexico                                      5638
Philippines                                 5374
Poland                                      5226
China                                       5170
Greece                                      5162
Colombia                                    4803
France                                      4700
Other North & Central Amer., Caribbean      4434
Ukraine                                     4100
Morocco                                     3966
Indonesia                                   3410
Bangladesh                                  3182
Japan                                       2270
Portugal                                    2193
Other Oceania                                346
dtype: int64

And now to divide each country's exploration number by its enrollment number. We could do it with a for loop over all the countries. But we can vectorize.

In [64]:

explored_by_country/enrolled_by_country

Out[64]:

Australia                                 0.069014
Bangladesh                                0.044940
Brazil                                    0.054603
Canada                                    0.068614
China                                     0.034043
Colombia                                  0.087029
Egypt                                     0.044583
France                                    0.092766
Germany                                   0.101932
Greece                                    0.100349
India                                     0.080015
Indonesia                                 0.065396
Japan                                     0.049339
Mexico                                    0.066513
Morocco                                   0.033283
Nigeria                                   0.058533
Other Africa                              0.067833
Other East Asia                           0.047161
Other Europe                              0.098398
Other Middle East/Central Asia            0.050505
Other North & Central Amer., Caribbean    0.070591
Other Oceania                             0.072254
Other South America                       0.067164
Other South Asia                          0.065733
Pakistan                                  0.040373
Philippines                               0.040752
Poland                                    0.134328
Portugal                                  0.102599
Russian Federation                        0.108800
Spain                                     0.141458
Ukraine                                   0.114390
United Kingdom                            0.079481
United States                             0.054993
Unknown/Other                             0.011764
dtype: float64

Cool beans!

Think about what that division means: for each country, pick out the value in the explored vector for that country:

$A = \begin{pmatrix} exploration_{US} \\ exploration_{china} \\ exploration_{france} \end{pmatrix} $

and then pick out the value in the enrolled vector for that country:

$B = \begin{pmatrix} enrollment_{US} \\ enrollment_{china} \\ enrollment_{france} \end{pmatrix} $

Before we talked about different kinds of functions on vectors. Here we want corresponding terms to be divided.

$\begin{pmatrix} exploration_{US}/enrollment_{US} \\ exploration_{china}/enrollment_{china} \\ exploration_{france}/enrollment_{france} \\ \cdots \end{pmatrix}$

In [65]:

diligence_in_exploration=(explored_by_country/enrolled_by_country)
#Can we plot it? Yes!

In [66]:

diligence_in_exploration.plot(kind='bar', title='Diligence: Exploration per enrollment')

Out[66]:

<matplotlib.axes.AxesSubplot at 0x7ff72a5e1750>

Some nations appear to have a much higher rate of diligence in exploring chapters.

And there appears to be a positive correlation of diligence and being between 'Po' and 'Uk' in the alphabet.

Crucial Principle of Data Analysis¶

Correlation equals causation¶

In [67]:

from IPython.display import Image
Image("https://sslimgs.xkcd.com/comics/correlation.png")
    

Out[67]:

In [68]:

certified=df[df['certified']==1]

In [69]:

len(certified)

Out[69]:

In [70]:

certified_by_country=certified['final_cc_cname_DI'].value_counts()

In [71]:

diligence=(certified_by_country/enrolled_by_country)
diligence.plot(kind='bar', title='Diligence redux: certification per enrollment')

Out[71]:

<matplotlib.axes.AxesSubplot at 0x7ff72a2b0510>

what if we compared completion by gender? or by age?

In [72]:

certified_male=certified[certified['gender']=='m']
certified_male_by_country=certified_male['final_cc_cname_DI'].value_counts()

In [73]:

enrolled_male=df[df['gender']=='m']

In [74]:

enrolled_male_by_country=enrolled_male['final_cc_cname_DI'].value_counts()

In [75]:

diligence_male=(certified_male_by_country/enrolled_male_by_country)

In [76]:

diligence_male.plot(kind='bar', title='Diligence: Certification per enrollment(dudes)')

Out[76]:

<matplotlib.axes.AxesSubplot at 0x7ff729f74f10>

In [77]:

certified_female=certified[certified['gender']=='f']
certified_female_by_country=certified_female['final_cc_cname_DI'].value_counts()
enrolled_female=df[df['gender']=='f']
enrolled_female_by_country=enrolled_female['final_cc_cname_DI'].value_counts()
diligence_female=(certified_female_by_country/enrolled_female_by_country)
diligence_female.plot(kind='bar', title='Diligence: Certification per enrollment(lasses)')

Out[77]:

<matplotlib.axes.AxesSubplot at 0x7ff729ceded0>

In [78]:

diligence_genders=pd.DataFrame([diligence_female, diligence_male], index=["female","male"])

In [79]:

diligence_genders

Out[79]:

	Australia	Bangladesh	Brazil	Canada	China	Colombia	Egypt	France	Germany	Greece	...	Pakistan	Philippines	Poland	Portugal	Russian Federation	Spain	Ukraine	United Kingdom	United States	Unknown/Other
female	0.040091	0.040984	0.021733	0.027979	0.022312	0.058757	0.019417	0.062500	0.076320	0.065415	...	0.022563	0.010365	0.055306	0.084211	0.031308	0.090592	0.033624	0.044515	0.024095	0.000907
male	0.026940	0.009324	0.024028	0.022749	0.009840	0.032022	0.010848	0.039514	0.052922	0.055712	...	0.012190	0.019212	0.084135	0.047189	0.060886	0.077693	0.055294	0.031626	0.019549	0.000643

2 rows × 34 columns

In [80]:

diligence_genders.T

Out[80]:

	female	male
Australia	0.040091	0.026940
Bangladesh	0.040984	0.009324
Brazil	0.021733	0.024028
Canada	0.027979	0.022749
China	0.022312	0.009840
Colombia	0.058757	0.032022
Egypt	0.019417	0.010848
France	0.062500	0.039514
Germany	0.076320	0.052922
Greece	0.065415	0.055712
India	0.045643	0.033067
Indonesia	0.041284	0.022642
Japan	0.008811	0.013754
Mexico	0.040107	0.021894
Morocco	0.015075	0.006401
Nigeria	0.035941	0.016881
Other Africa	0.041943	0.021750
Other East Asia	0.022238	0.019379
Other Europe	0.049466	0.041956
Other Middle East/Central Asia	0.027838	0.015398
Other North & Central Amer., Caribbean	0.046016	0.032689
Other Oceania	0.023810	NaN
Other South America	0.045977	0.023626
Other South Asia	0.035273	0.030398
Pakistan	0.022563	0.012190
Philippines	0.010365	0.019212
Poland	0.055306	0.084135
Portugal	0.084211	0.047189
Russian Federation	0.031308	0.060886
Spain	0.090592	0.077693
Ukraine	0.033624	0.055294
United Kingdom	0.044515	0.031626
United States	0.024095	0.019549
Unknown/Other	0.000907	0.000643

In [81]:

diligence_genders.T[["female","male"]].plot(kind="bar", title="Diligence: Certification per enrollment (lasses and dudes)")

Out[81]:

<matplotlib.axes.AxesSubplot at 0x7ff72a2a3fd0>

/usr/lib/pymodules/python2.7/matplotlib/font_manager.py:1246: UserWarning: findfont: Could not match :family=Bitstream Vera Sans:style=normal:variant=normal:weight=normal:stretch=normal:size=large. Returning /usr/share/matplotlib/mpl-data/fonts/ttf/cmb10.ttf
  UserWarning)