# Render our plots inline
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
pd.set_option('display.mpl_style', 'default') # Make the graphs a bit prettier
plt.rcParams['figure.figsize'] = (15, 5)
#this presumes you've the two data sets locally
You can find the dataset and documentation at http://harvardx.harvard.edu/dataset
Exploratory data analysis (EDA) seeks to reveal structure, or simple descriptions, in data. We look at numbers and graphs and try to find patterns.
- Persi Diaconis, "Theories of Data Analysis: From Magical Thinking Through Classical statistics"
. . . proceeding via a ‘dustbowl’ empiricism is dangerous at worst and foolish at best . . . . The purely empirical approach is particularly dangerous in an age when computers and packaged programs are readily available, since there is temptation to substitute immediate empirical analysis for more analytic thought and theory building.
- Einhorn, “Alchemy in the Behavioral Sciences,” 1972
. . . we can view the techniques of EDA as a ritual designed to reveal patters in a data set. Thus, we may believe that naturally occurring data sets contain structure, that EDA is a useful vehicle for revealing the structure. . . . If we make no attempt to check whether the structure could have arisen by chance, and tend to accept the findinds as gospel, then the ritual comes close to magical thinking. ... a controlled form of magical thinking--in the guise of 'working hypothesis'--is a basic ingredient of scientific progress.
- Persi Diaconis, "Theories of Data Analysis: From Magical Thinking Through Classical statistics"
Pandas
first-line python
tool for EDA¶Pandas
: charismatic megafauna¶import pandas as pd
Use example of multiple average house prices by year by CPI.
If we wanted to understanding the fluctuations in house prices over time, we can use a new datatype called series.
CPI={"2010": 218.056, "2011": 224.939, "2012": 229.594, "2013": 232.957} #http://www.bls.gov/cpi/home.htm
The series
method converts this dict
to
CPI_series=pd.Series(CPI)
CPI_series
2010 218.056 2011 224.939 2012 229.594 2013 232.957 dtype: float64
House_sale_mean={"2010": 100000, "2011": 100000, "2012": 100000, "2013": 100000}
House_sale_series=pd.Series(House_sale_mean)
House_sale_series
2010 100000 2011 100000 2012 100000 2013 100000 dtype: int64
The CPI provides "a measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services." A higher number means it costs more to buy the same goods. It was set to 100 in 1982-4.
We can thus use it to measure the effects of inflation on the value of houses in our toy example.
(House_sale_series/CPI_series)
2010 458.597791 2011 444.564971 2012 435.551452 2013 429.263770 dtype: float64
What's going here: pandas
has taken the two series and divided the values that share the same index.
#If we multiply each one by 100, we'll get the value of our houses in 1982-4 dollars.
(House_sale_series/CPI_series)*100
2010 45859.779139 2011 44456.497095 2012 43555.145169 2013 42926.376971 dtype: float64
inflation_adjusted=(House_sale_series/CPI_series)*100
#can perform calculations on individual
inflation_adjusted['2013']/inflation_adjusted['2010']
0.93603540567572574
#plotting is simple as pie
inflation_adjusted.plot()
<matplotlib.axes.AxesSubplot at 0x7ff736b76a50>
/usr/lib/pymodules/python2.7/matplotlib/font_manager.py:1236: UserWarning: findfont: Font family ['monospace'] not found. Falling back to Bitstream Vera Sans (prop.get_family(), self.defaultFamily[fontext])) /usr/lib/pymodules/python2.7/matplotlib/font_manager.py:1246: UserWarning: findfont: Could not match :family=Bitstream Vera Sans:style=normal:variant=normal:weight=normal:stretch=normal:size=medium. Returning /usr/share/matplotlib/mpl-data/fonts/ttf/cmb10.ttf UserWarning)
(House_sale_series/CPI_series).plot(title="Sorry, kids. Blame X, where X is current politician we don't like.")
<matplotlib.axes.AxesSubplot at 0x7ff736b76f50>
/usr/lib/pymodules/python2.7/matplotlib/font_manager.py:1246: UserWarning: findfont: Could not match :family=Bitstream Vera Sans:style=normal:variant=normal:weight=normal:stretch=normal:size=x-large. Returning /usr/share/matplotlib/mpl-data/fonts/ttf/cmb10.ttf UserWarning)
What's super deceptive about this plot?
The library Pandas
provides us with a powerful overlay that lets us use matrices but always keep their row and column names, or as a spreadsheet on speed. It allows us to work directly with the datatype "Dataframes" that keeps track of values and their names for us. And it allows us to perform many operations on slices of the dataframe without having to run for
loops and the like. This is more convenient and involves faster processing.
#Let's start with yet another way to read csv files, this time from `pandas`
df=pd.read_csv('./HMXPC13_DI_v2_5-14-14.csv', sep=",")
#take a look--it's a biggie
df
course_id | userid_DI | registered | viewed | explored | certified | final_cc_cname_DI | LoE_DI | YoB | gender | grade | start_time_DI | last_event_DI | nevents | ndays_act | nplay_video | nchapters | nforum_posts | roles | incomplete_flag | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | HarvardX/CB22x/2013_Spring | MHxPC130442623 | 1 | 0 | 0 | 0 | United States | NaN | NaN | NaN | 0 | 2012-12-19 | 2013-11-17 | NaN | 9 | NaN | NaN | 0 | NaN | 1 |
1 | HarvardX/CS50x/2012 | MHxPC130442623 | 1 | 1 | 0 | 0 | United States | NaN | NaN | NaN | 0 | 2012-10-15 | NaN | NaN | 9 | NaN | 1 | 0 | NaN | 1 |
2 | HarvardX/CB22x/2013_Spring | MHxPC130275857 | 1 | 0 | 0 | 0 | United States | NaN | NaN | NaN | 0 | 2013-02-08 | 2013-11-17 | NaN | 16 | NaN | NaN | 0 | NaN | 1 |
3 | HarvardX/CS50x/2012 | MHxPC130275857 | 1 | 0 | 0 | 0 | United States | NaN | NaN | NaN | 0 | 2012-09-17 | NaN | NaN | 16 | NaN | NaN | 0 | NaN | 1 |
4 | HarvardX/ER22x/2013_Spring | MHxPC130275857 | 1 | 0 | 0 | 0 | United States | NaN | NaN | NaN | 0 | 2012-12-19 | NaN | NaN | 16 | NaN | NaN | 0 | NaN | 1 |
5 | HarvardX/PH207x/2012_Fall | MHxPC130275857 | 1 | 1 | 1 | 0 | United States | NaN | NaN | NaN | 0 | 2012-09-17 | 2013-05-23 | 502 | 16 | 50 | 12 | 0 | NaN | NaN |
6 | HarvardX/PH278x/2013_Spring | MHxPC130275857 | 1 | 0 | 0 | 0 | United States | NaN | NaN | NaN | 0 | 2013-02-08 | NaN | NaN | 16 | NaN | NaN | 0 | NaN | 1 |
7 | HarvardX/CB22x/2013_Spring | MHxPC130539455 | 1 | 1 | 0 | 0 | France | NaN | NaN | NaN | 0 | 2013-01-01 | 2013-05-14 | 42 | 6 | NaN | 3 | 0 | NaN | NaN |
8 | HarvardX/CB22x/2013_Spring | MHxPC130088379 | 1 | 1 | 0 | 0 | United States | NaN | NaN | NaN | 0 | 2013-02-18 | 2013-03-17 | 70 | 3 | NaN | 3 | 0 | NaN | NaN |
9 | HarvardX/CS50x/2012 | MHxPC130088379 | 1 | 1 | 0 | 0 | United States | NaN | NaN | NaN | 0 | 2012-10-20 | NaN | NaN | 12 | NaN | 3 | 0 | NaN | 1 |
10 | HarvardX/ER22x/2013_Spring | MHxPC130088379 | 1 | 1 | 0 | 0 | United States | NaN | NaN | NaN | 0 | 2013-02-23 | 2013-06-14 | 17 | 2 | NaN | 2 | 0 | NaN | NaN |
11 | HarvardX/ER22x/2013_Spring | MHxPC130198098 | 1 | 1 | 0 | 0 | United States | NaN | NaN | NaN | 0 | 2013-06-17 | 2013-06-17 | 32 | 1 | NaN | 3 | 0 | NaN | NaN |
12 | HarvardX/CB22x/2013_Spring | MHxPC130024894 | 1 | 1 | 0 | 0 | United States | NaN | NaN | NaN | 0.07 | 2013-01-24 | 2013-08-03 | 175 | 9 | NaN | 7 | 0 | NaN | NaN |
13 | HarvardX/CS50x/2012 | MHxPC130024894 | 1 | 1 | 0 | 0 | United States | NaN | NaN | NaN | 0 | 2013-06-27 | NaN | NaN | 2 | NaN | 2 | 0 | NaN | 1 |
14 | HarvardX/ER22x/2013_Spring | MHxPC130024894 | 1 | 1 | 0 | 0 | United States | NaN | NaN | NaN | 0 | 2012-12-19 | 2013-08-17 | 78 | 5 | NaN | 4 | 0 | NaN | NaN |
15 | HarvardX/PH207x/2012_Fall | MHxPC130024894 | 1 | 1 | 0 | 0 | United States | NaN | NaN | NaN | 0 | 2012-07-26 | 2013-01-16 | 75 | 14 | 5 | 2 | 0 | NaN | NaN |
16 | HarvardX/PH278x/2013_Spring | MHxPC130024894 | 1 | 1 | 0 | 0 | United States | NaN | NaN | NaN | 0 | 2013-07-30 | 2013-08-27 | 11 | 2 | 2 | 1 | 0 | NaN | NaN |
17 | HarvardX/CS50x/2012 | MHxPC130080986 | 1 | 1 | 0 | 0 | United States | NaN | NaN | NaN | 0 | 2012-10-15 | NaN | NaN | 11 | NaN | 1 | 0 | NaN | 1 |
18 | HarvardX/PH207x/2012_Fall | MHxPC130080986 | 1 | 1 | 0 | 0 | United States | NaN | NaN | NaN | 0 | 2012-10-25 | 2012-12-04 | 56 | 11 | 1 | 2 | 1 | NaN | NaN |
19 | HarvardX/CS50x/2012 | MHxPC130063375 | 1 | 1 | 0 | 0 | Unknown/Other | NaN | NaN | NaN | 0 | 2012-10-19 | NaN | NaN | NaN | NaN | 1 | 0 | NaN | 1 |
20 | HarvardX/CS50x/2012 | MHxPC130094371 | 1 | 1 | 0 | 0 | United States | NaN | NaN | NaN | 0 | 2013-03-03 | 2013-03-03 | 7 | 1 | NaN | 2 | 0 | NaN | NaN |
21 | HarvardX/CS50x/2012 | MHxPC130229084 | 1 | 1 | 0 | 0 | Mexico | NaN | NaN | NaN | 0 | 2012-10-15 | NaN | NaN | NaN | NaN | 1 | 0 | NaN | 1 |
22 | HarvardX/CS50x/2012 | MHxPC130300925 | 1 | 1 | 0 | 0 | United States | NaN | NaN | NaN | 0 | 2012-10-24 | NaN | NaN | 2 | NaN | 1 | 0 | NaN | 1 |
23 | HarvardX/ER22x/2013_Spring | MHxPC130300925 | 1 | 1 | 0 | 0 | United States | NaN | NaN | NaN | 0 | 2012-12-20 | 2013-05-18 | 15 | 2 | NaN | 2 | 0 | NaN | NaN |
24 | HarvardX/CS50x/2012 | MHxPC130417650 | 1 | 1 | 0 | 0 | Australia | NaN | NaN | NaN | 0 | 2012-10-29 | 2013-03-04 | 1 | 1 | NaN | 2 | 0 | NaN | NaN |
25 | HarvardX/CS50x/2012 | MHxPC130506580 | 1 | 0 | 0 | 0 | United States | NaN | NaN | NaN | 0 | 2012-09-04 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN |
26 | HarvardX/CS50x/2012 | MHxPC130298257 | 1 | 0 | 0 | 0 | United States | NaN | NaN | NaN | 0 | 2012-09-05 | NaN | NaN | NaN | NaN | 3 | 0 | NaN | 1 |
27 | HarvardX/CS50x/2012 | MHxPC130500569 | 1 | 1 | 0 | 0 | United States | NaN | NaN | NaN | 0 | 2012-10-22 | 2013-03-30 | 6 | 1 | NaN | 5 | 0 | NaN | NaN |
28 | HarvardX/CS50x/2012 | MHxPC130466479 | 1 | 1 | 0 | 0 | Unknown/Other | NaN | NaN | NaN | 0 | 2013-01-07 | NaN | NaN | NaN | NaN | 1 | 0 | NaN | 1 |
29 | HarvardX/CB22x/2013_Spring | MHxPC130340959 | 1 | 1 | 0 | 0 | United States | NaN | NaN | NaN | 0.05 | 2013-02-11 | 2013-04-06 | 285 | 8 | NaN | 4 | 0 | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
641108 | MITx/6.002x/2013_Spring | MHxPC130140735 | 1 | 1 | 0 | 0 | United States | Bachelor's | 1991 | m | NaN | 2013-09-07 | 2013-09-07 | 59 | 1 | 5 | 3 | 0 | NaN | NaN |
641109 | MITx/6.00x/2013_Spring | MHxPC130493130 | 1 | 0 | 0 | 0 | United Kingdom | Master's | 1977 | m | NaN | 2013-09-07 | NaN | NaN | NaN | NaN | 2 | 0 | NaN | 1 |
641110 | MITx/6.00x/2013_Spring | MHxPC130400592 | 1 | 1 | 0 | 0 | Other Europe | Secondary | 1992 | m | NaN | 2013-09-07 | 2013-09-07 | 395 | 1 | 51 | 4 | 0 | NaN | NaN |
641111 | MITx/6.00x/2013_Spring | MHxPC130109892 | 1 | 1 | 0 | 0 | India | Secondary | 1995 | m | NaN | 2013-09-07 | 2013-09-07 | 49 | 1 | 14 | 2 | 0 | NaN | NaN |
641112 | MITx/14.73x/2013_Spring | MHxPC130183007 | 1 | 0 | 0 | 0 | India | Master's | 1985 | m | NaN | 2013-09-07 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN |
641113 | MITx/8.MReV/2013_Summer | MHxPC130261281 | 1 | 1 | 0 | 0 | India | Secondary | 1994 | m | 0 | 2013-09-07 | 2013-09-07 | 8 | 1 | NaN | 1 | 0 | NaN | NaN |
641114 | MITx/6.00x/2013_Spring | MHxPC130481990 | 1 | 1 | 0 | 0 | India | Bachelor's | 1989 | m | NaN | 2013-09-07 | 2013-09-07 | 22 | 1 | 5 | 1 | 0 | NaN | NaN |
641115 | MITx/6.00x/2013_Spring | MHxPC130528581 | 1 | 0 | 0 | 0 | United States | Bachelor's | 1990 | f | NaN | 2013-09-07 | 2013-09-07 | 2 | 1 | NaN | 3 | 0 | NaN | NaN |
641116 | MITx/14.73x/2013_Spring | MHxPC130555418 | 1 | 0 | 0 | 0 | Unknown/Other | Bachelor's | 1988 | m | NaN | 2013-09-07 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN |
641117 | MITx/6.002x/2013_Spring | MHxPC130408810 | 1 | 0 | 0 | 0 | India | Secondary | 1993 | m | NaN | 2013-09-07 | 2013-09-07 | 2 | 1 | NaN | 3 | 0 | NaN | NaN |
641118 | MITx/6.00x/2013_Spring | MHxPC130040184 | 1 | 0 | 0 | 0 | United States | Secondary | 1991 | m | NaN | 2013-09-07 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN |
641119 | MITx/6.002x/2013_Spring | MHxPC130566049 | 1 | 0 | 0 | 0 | Other Europe | Master's | 1982 | m | NaN | 2013-09-07 | 2013-09-07 | 2 | 1 | NaN | 2 | 0 | NaN | NaN |
641120 | MITx/8.MReV/2013_Summer | MHxPC130374105 | 1 | 1 | 0 | 0 | India | Bachelor's | 1992 | m | 0 | 2013-09-07 | 2013-09-07 | 49 | 1 | NaN | 1 | 0 | NaN | NaN |
641121 | MITx/6.00x/2013_Spring | MHxPC130282999 | 1 | 0 | 0 | 0 | Other Europe | Master's | 1979 | m | NaN | 2013-09-07 | NaN | NaN | NaN | NaN | 7 | 0 | NaN | 1 |
641122 | MITx/8.MReV/2013_Summer | MHxPC130556398 | 1 | 0 | 0 | 0 | India | Bachelor's | 1985 | m | 0 | 2013-09-07 | 2013-09-07 | 1 | 1 | NaN | NaN | 0 | NaN | NaN |
641123 | MITx/6.00x/2013_Spring | MHxPC130573334 | 1 | 0 | 0 | 0 | Spain | Bachelor's | 1989 | m | NaN | 2013-09-07 | 2013-09-07 | 1 | 1 | NaN | NaN | 0 | NaN | NaN |
641124 | MITx/6.00x/2013_Spring | MHxPC130505931 | 1 | 1 | 0 | 0 | India | Secondary | 1995 | m | NaN | 2013-09-07 | 2013-09-07 | 59 | 1 | NaN | 2 | 0 | NaN | NaN |
641125 | MITx/6.002x/2013_Spring | MHxPC130280976 | 1 | 0 | 0 | 0 | United States | Bachelor's | NaN | m | NaN | 2013-09-07 | 2013-09-07 | 2 | 1 | NaN | NaN | 0 | NaN | NaN |
641126 | MITx/6.00x/2013_Spring | MHxPC130137331 | 1 | 1 | 0 | 0 | United States | Secondary | 1992 | m | NaN | 2013-09-07 | 2013-09-07 | 251 | 1 | 77 | 4 | 0 | NaN | NaN |
641127 | MITx/6.002x/2013_Spring | MHxPC130271624 | 1 | 0 | 0 | 0 | India | Bachelor's | 1989 | m | NaN | 2013-09-07 | 2013-09-07 | 1 | 1 | NaN | NaN | 0 | NaN | NaN |
641128 | MITx/14.73x/2013_Spring | MHxPC130256541 | 1 | 1 | 0 | 0 | United States | Master's | 1982 | m | NaN | 2013-09-07 | 2013-09-07 | 51 | 1 | 1 | 1 | 0 | NaN | NaN |
641129 | MITx/6.00x/2013_Spring | MHxPC130021638 | 1 | 0 | 0 | 0 | Unknown/Other | Bachelor's | 1988 | m | NaN | 2013-09-07 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN |
641130 | MITx/14.73x/2013_Spring | MHxPC130591057 | 1 | 0 | 0 | 0 | Canada | Bachelor's | NaN | f | NaN | 2013-09-07 | 2013-09-07 | 6 | 1 | NaN | NaN | 0 | NaN | NaN |
641131 | MITx/8.02x/2013_Spring | MHxPC130226305 | 1 | 0 | 0 | 0 | Unknown/Other | Bachelor's | 1988 | m | NaN | 2013-09-07 | 2013-09-07 | 11 | 1 | NaN | 2 | 0 | NaN | NaN |
641132 | MITx/6.002x/2013_Spring | MHxPC130030805 | 1 | 1 | 0 | 0 | Pakistan | Master's | 1989 | m | NaN | 2013-09-07 | 2013-09-07 | 29 | 1 | NaN | 1 | 0 | NaN | NaN |
641133 | MITx/6.00x/2013_Spring | MHxPC130184108 | 1 | 1 | 0 | 0 | Canada | Bachelor's | 1991 | m | NaN | 2013-09-07 | 2013-09-07 | 97 | 1 | 4 | 2 | 0 | NaN | NaN |
641134 | MITx/6.00x/2013_Spring | MHxPC130359782 | 1 | 0 | 0 | 0 | Other Europe | Bachelor's | 1991 | f | NaN | 2013-09-07 | 2013-09-07 | 1 | 1 | NaN | NaN | 0 | NaN | NaN |
641135 | MITx/6.002x/2013_Spring | MHxPC130098513 | 1 | 0 | 0 | 0 | United States | Doctorate | 1979 | m | NaN | 2013-09-07 | 2013-09-07 | 1 | 1 | NaN | NaN | 0 | NaN | NaN |
641136 | MITx/6.00x/2013_Spring | MHxPC130098513 | 1 | 1 | 0 | 0 | United States | Doctorate | 1979 | m | NaN | 2013-09-07 | 2013-09-07 | 74 | 1 | 14 | 1 | 0 | NaN | NaN |
641137 | MITx/8.02x/2013_Spring | MHxPC130098513 | 1 | 0 | 0 | 0 | United States | Doctorate | 1979 | m | NaN | 2013-09-07 | NaN | NaN | 1 | NaN | NaN | 0 | NaN | 1 |
641138 rows × 20 columns
Note at the bottom that the display tells us how many rows and columns we're dealing with.
As a general rule, pandas
dataframe objects default to slicing by column using a syntax you'll know from dict
s as in df["course_id"]
.
df["course_id"]
0 HarvardX/CB22x/2013_Spring 1 HarvardX/CS50x/2012 2 HarvardX/CB22x/2013_Spring 3 HarvardX/CS50x/2012 4 HarvardX/ER22x/2013_Spring 5 HarvardX/PH207x/2012_Fall 6 HarvardX/PH278x/2013_Spring 7 HarvardX/CB22x/2013_Spring 8 HarvardX/CB22x/2013_Spring 9 HarvardX/CS50x/2012 10 HarvardX/ER22x/2013_Spring 11 HarvardX/ER22x/2013_Spring 12 HarvardX/CB22x/2013_Spring 13 HarvardX/CS50x/2012 14 HarvardX/ER22x/2013_Spring ... 641123 MITx/6.00x/2013_Spring 641124 MITx/6.00x/2013_Spring 641125 MITx/6.002x/2013_Spring 641126 MITx/6.00x/2013_Spring 641127 MITx/6.002x/2013_Spring 641128 MITx/14.73x/2013_Spring 641129 MITx/6.00x/2013_Spring 641130 MITx/14.73x/2013_Spring 641131 MITx/8.02x/2013_Spring 641132 MITx/6.002x/2013_Spring 641133 MITx/6.00x/2013_Spring 641134 MITx/6.00x/2013_Spring 641135 MITx/6.002x/2013_Spring 641136 MITx/6.00x/2013_Spring 641137 MITx/8.02x/2013_Spring Name: course_id, Length: 641138, dtype: object
We can select some subset of rows as you might expect:
df["course_id"][3340:3350]
3340 HarvardX/CS50x/2012 3341 HarvardX/ER22x/2013_Spring 3342 HarvardX/PH278x/2013_Spring 3343 HarvardX/CS50x/2012 3344 HarvardX/CS50x/2012 3345 HarvardX/ER22x/2013_Spring 3346 HarvardX/CS50x/2012 3347 HarvardX/CB22x/2013_Spring 3348 HarvardX/CS50x/2012 3349 HarvardX/CS50x/2012 Name: course_id, dtype: object
df[3340:3350]
course_id | userid_DI | registered | viewed | explored | certified | final_cc_cname_DI | LoE_DI | YoB | gender | grade | start_time_DI | last_event_DI | nevents | ndays_act | nplay_video | nchapters | nforum_posts | roles | incomplete_flag | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3340 | HarvardX/CS50x/2012 | MHxPC130386705 | 1 | 1 | 0 | 0 | Russian Federation | NaN | NaN | NaN | 0 | 2012-08-17 | NaN | NaN | NaN | NaN | 2 | 0 | NaN | 1 |
3341 | HarvardX/ER22x/2013_Spring | MHxPC130432757 | 1 | 1 | 0 | 0 | United States | NaN | NaN | NaN | NaN | 2013-09-05 | 2013-09-05 | 16 | 1 | NaN | 2 | 0 | NaN | NaN |
3342 | HarvardX/PH278x/2013_Spring | MHxPC130432757 | 1 | 0 | 0 | 0 | United States | NaN | NaN | NaN | NaN | 2012-12-25 | NaN | NaN | 1 | NaN | NaN | 0 | NaN | 1 |
3343 | HarvardX/CS50x/2012 | MHxPC130382204 | 1 | 1 | 0 | 0 | Ukraine | NaN | NaN | NaN | 0 | 2012-11-30 | NaN | NaN | NaN | NaN | 5 | 0 | NaN | 1 |
3344 | HarvardX/CS50x/2012 | MHxPC130142047 | 1 | 1 | 0 | 0 | Spain | NaN | NaN | NaN | 0.0 | 2013-07-12 | 2013-07-12 | 8 | 1 | NaN | 1 | 0 | NaN | NaN |
3345 | HarvardX/ER22x/2013_Spring | MHxPC130191600 | 1 | 0 | 0 | 0 | India | NaN | NaN | NaN | 0 | 2012-12-23 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN |
3346 | HarvardX/CS50x/2012 | MHxPC130079233 | 1 | 0 | 0 | 0 | United States | NaN | NaN | NaN | 0 | 2012-08-17 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN |
3347 | HarvardX/CB22x/2013_Spring | MHxPC130277592 | 1 | 1 | 0 | 0 | United States | NaN | NaN | NaN | 0.04 | 2013-01-23 | 2013-04-04 | 333 | 8 | NaN | 4 | 0 | NaN | NaN |
3348 | HarvardX/CS50x/2012 | MHxPC130429812 | 1 | 1 | 0 | 0 | United Kingdom | NaN | NaN | NaN | 0 | 2012-08-18 | NaN | NaN | NaN | NaN | 2 | 0 | NaN | 1 |
3349 | HarvardX/CS50x/2012 | MHxPC130503405 | 1 | 0 | 0 | 0 | Russian Federation | NaN | NaN | NaN | 0.0 | 2012-09-03 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN |
df[666]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-21-342ddc6d1cab> in <module>() ----> 1 df[666] /usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in __getitem__(self, key) 1682 return self._getitem_multilevel(key) 1683 else: -> 1684 return self._getitem_column(key) 1685 1686 def _getitem_column(self, key): /usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _getitem_column(self, key) 1689 # get column 1690 if self.columns.is_unique: -> 1691 return self._get_item_cache(key) 1692 1693 # duplicate columns & possible reduce dimensionaility /usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in _get_item_cache(self, item) 1050 res = cache.get(item) 1051 if res is None: -> 1052 values = self._data.get(item) 1053 res = self._box_item_values(item, values) 1054 cache[item] = res /usr/local/lib/python2.7/dist-packages/pandas/core/internals.pyc in get(self, item) 2535 2536 if not isnull(item): -> 2537 loc = self.items.get_loc(item) 2538 else: 2539 indexer = np.arange(len(self.items))[isnull(self.items)] /usr/local/lib/python2.7/dist-packages/pandas/core/index.pyc in get_loc(self, key) 1154 loc : int if unique index, possibly slice or mask if not 1155 """ -> 1156 return self._engine.get_loc(_values_from_object(key)) 1157 1158 def get_value(self, series, key): /usr/local/lib/python2.7/dist-packages/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3353)() /usr/local/lib/python2.7/dist-packages/pandas/index.so in pandas.index.IndexEngine.get_loc (pandas/index.c:3233)() /usr/local/lib/python2.7/dist-packages/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11148)() /usr/local/lib/python2.7/dist-packages/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:11101)() KeyError: 666
Huh? I have row 666! pandas
doesn't let you select a single row directly. Instead, use the .ix
method.
df.ix[666]
course_id HarvardX/CS50x/2012 userid_DI MHxPC130297337 registered 1 viewed 0 explored 0 certified 0 final_cc_cname_DI United Kingdom LoE_DI NaN YoB NaN gender NaN grade 0 start_time_DI 2012-08-17 last_event_DI NaN nevents NaN ndays_act NaN nplay_video NaN nchapters NaN nforum_posts 0 roles NaN incomplete_flag NaN Name: 666, dtype: object
Why? A good question. Now try passing a list:
df.ix[[666]]
course_id | userid_DI | registered | viewed | explored | certified | final_cc_cname_DI | LoE_DI | YoB | gender | grade | start_time_DI | last_event_DI | nevents | ndays_act | nplay_video | nchapters | nforum_posts | roles | incomplete_flag | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
666 | HarvardX/CS50x/2012 | MHxPC130297337 | 1 | 0 | 0 | 0 | United Kingdom | NaN | NaN | NaN | 0 | 2012-08-17 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN |
What types of data did pandas
import from the csv?
df.dtypes
course_id object userid_DI object registered int64 viewed int64 explored int64 certified int64 final_cc_cname_DI object LoE_DI object YoB float64 gender object grade object start_time_DI object last_event_DI object nevents float64 ndays_act float64 nplay_video float64 nchapters float64 nforum_posts int64 roles float64 incomplete_flag float64 dtype: object
In inputing CSV, Pandas parses each column and attempts to discern what sort of data is within. It's good but not infallible.
Let's refine our reading of the CSV to parse the dates.
df=pd.read_csv('HMXPC13_DI_v2_5-14-14.csv', sep="," , parse_dates=['start_time_DI', 'last_event_DI'])
We pass a list of the columns to consider as dates.
df.dtypes
course_id object userid_DI object registered int64 viewed int64 explored int64 certified int64 final_cc_cname_DI object LoE_DI object YoB float64 gender object grade object start_time_DI datetime64[ns] last_event_DI datetime64[ns] nevents float64 ndays_act float64 nplay_video float64 nchapters float64 nforum_posts int64 roles float64 incomplete_flag float64 dtype: object
We don't need all those fields. We can cull them by name using the method drop
.
df=df.drop(['userid_DI', 'roles', 'incomplete_flag', 'nforum_posts'], axis=1) # axis=1 means the column names, not the rows
We can pick out columns using their names--just like with dicts:
df['final_cc_cname_DI'][100:110]
100 United States 101 United States 102 United States 103 United States 104 Russian Federation 105 Russian Federation 106 United States 107 United States 108 United States 109 United States Name: final_cc_cname_DI, dtype: object
We can easily produce new dataframes that retain the relevant indices.
df[['gender', 'ndays_act','nplay_video']][1781:1787]
#note the double [[]]--you're providing a list [x,y. . . ] of the columns you want
gender | ndays_act | nplay_video | |
---|---|---|---|
1781 | NaN | NaN | NaN |
1782 | NaN | 1 | NaN |
1783 | NaN | 2 | NaN |
1784 | NaN | 3 | NaN |
1785 | NaN | 1 | NaN |
1786 | NaN | 1 | NaN |
"NaN" is not a English grandmother. It is an unreported or empty value.
Often we might treat it as a zero, but we need to be careful.
Would it matter in the case of nplay_video
?
How about gender
?
You've been writing functions and methods for the past couple of weeks. You'll recall functions like $f(x)$ and $g(x)$ from your previous math training.
Functions can be defined on vector and matrix qualities.
For our example, just consider the vector from the intersection of the x and y axes to a point (a,b), which we might write $\bar{v}=[a,b]$.
Some take a vector and return a non-vectoral number, called a scalar
.
For example, if we have the point (2,2) on the x and y axes, the length of the vector [2,2] from the origin to that point can be found with a standard distance function
$d([x,y])=|\sqrt{x^2+y^2}|$
$d([2,2])=|\sqrt{2^2+2^2}|=|\sqrt{8}|$
Or, more easily, we might compute the mean of the elements of a column in a vector.
Some functions take a vector and return the same kind of vector.
$f([x,y])=[2x,3y]$
$f([1,2])=[2,6]$
For our purposes
It's pretty easy for us to think of vectors in 2 or 3 dimensions. In much data analysis, we'll use many more dimensions. Consider just one of our rows in our dataframe today:
Pandas
makes it supremely easy to plot time series¶df['start_time_DI'] # This produces a series of start times
0 2012-12-19 1 2012-10-15 2 2013-02-08 3 2012-09-17 4 2012-12-19 5 2012-09-17 6 2013-02-08 7 2013-01-01 8 2013-02-18 9 2012-10-20 10 2013-02-23 11 2013-06-17 12 2013-01-24 13 2013-06-27 14 2012-12-19 ... 641123 2013-09-07 641124 2013-09-07 641125 2013-09-07 641126 2013-09-07 641127 2013-09-07 641128 2013-09-07 641129 2013-09-07 641130 2013-09-07 641131 2013-09-07 641132 2013-09-07 641133 2013-09-07 641134 2013-09-07 641135 2013-09-07 641136 2013-09-07 641137 2013-09-07 Name: start_time_DI, Length: 641138, dtype: datetime64[ns]
What if we wanted to figure out how many people started each day? We could run a for
loop and compute it ourslves.
But panda
has got this cover.
We use the .value_counts()
method. It sums the number of times a each given value in the data occurs in a series.
startdates=df['start_time_DI'].value_counts()
startdates
2012-08-17 10165 2013-01-23 8368 2012-10-15 6766 2012-08-16 6369 2012-12-20 5858 2013-02-14 5810 2012-12-21 5809 2012-08-18 5531 2012-08-13 5247 2013-03-03 5053 2012-10-16 4639 2012-07-24 4635 2013-02-15 4436 2013-01-22 4263 2012-08-20 4107 ... 2013-07-15 396 2013-07-18 390 2013-07-10 386 2013-07-20 378 2013-07-09 374 2013-07-08 365 2013-07-04 357 2013-07-12 334 2013-07-05 307 2013-07-14 279 2013-07-13 275 2013-07-06 274 2013-07-07 273 2012-07-23 5 2013-09-08 1 Length: 413
startdates.plot()
<matplotlib.axes.AxesSubplot at 0x7ff72dce9350>
df['nplay_video'].dropna().plot()
<matplotlib.axes.AxesSubplot at 0x7ff72dbb63d0>
Who are the frickin' overachievers? We'll investigate in just a sec.
How about: how many people enrolled per country?
work with your BFF for 2 minutes to figure out
1. how to pick out the field including country
2. how to count them
country=df['final_cc_cname_DI'].value_counts()
country
United States 184240 India 88696 Unknown/Other 82029 Other Europe 40377 Other Africa 23897 United Kingdom 22131 Brazil 17856 Other Middle East/Central Asia 17325 Other South Asia 12992 Canada 12738 Pakistan 10824 Russian Federation 10432 Spain 10003 Other South America 9916 Egypt 9286 Germany 8074 Nigeria 7483 Other East Asia 6446 Australia 6419 Mexico 5638 Philippines 5374 Poland 5226 China 5170 Greece 5162 Colombia 4803 France 4700 Other North & Central Amer., Caribbean 4434 Ukraine 4100 Morocco 3966 Indonesia 3410 Bangladesh 3182 Japan 2270 Portugal 2193 Other Oceania 346 dtype: int64
country[:15].plot(kind='bar')
<matplotlib.axes.AxesSubplot at 0x7ff728c11a90>
Often we want to pick out all the rows that have certain attributes from the entire dataframe: those belonging to French people, or all the self-identified women, or the overachievers.
To do this create a vector where each element indicates whether each row satisfies the condition. It looks like a long list of trues and falses, so it's called a boolean vector.
#pick out all the people who watched over 10000[!] minutes of video
df['nplay_video']>10000
0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False 12 False 13 False 14 False ... 641123 False 641124 False 641125 False 641126 False 641127 False 641128 False 641129 False 641130 False 641131 False 641132 False 641133 False 641134 False 641135 False 641136 False 641137 False Name: nplay_video, Length: 641138, dtype: bool
#use that long vector of trues and falses to pick out just those rows
df[df['nplay_video']>10000]
course_id | registered | viewed | explored | certified | final_cc_cname_DI | LoE_DI | YoB | gender | grade | start_time_DI | last_event_DI | nevents | ndays_act | nplay_video | nchapters | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
83088 | HarvardX/PH207x/2012_Fall | 1 | 1 | 1 | 1 | Other Africa | NaN | 1986 | m | 0.93 | 2012-09-07 | 2013-01-18 | 15182 | 34 | 10668 | 15 |
130094 | HarvardX/PH207x/2012_Fall | 1 | 1 | 1 | 1 | Egypt | Bachelor's | 1988 | f | 0.92 | 2012-10-13 | 2013-05-19 | 33820 | 98 | 12983 | 16 |
136405 | HarvardX/PH207x/2012_Fall | 1 | 1 | 1 | 1 | Portugal | Master's | 1986 | m | 0.94 | 2012-10-15 | 2013-01-20 | 37440 | 71 | 17314 | 16 |
170632 | HarvardX/PH207x/2012_Fall | 1 | 1 | 1 | 0 | India | Master's | 1970 | m | 0 | 2012-11-11 | 2013-02-15 | 43880 | 61 | 34596 | 16 |
184827 | HarvardX/PH207x/2012_Fall | 1 | 1 | 0 | 0 | Other Africa | NaN | 1986 | m | 0 | 2012-12-05 | 2012-12-26 | 15079 | 8 | 14523 | 2 |
245432 | HarvardX/PH278x/2013_Spring | 1 | 1 | 1 | 0 | Other Africa | Bachelor's | 1982 | m | NaN | 2013-02-16 | 2013-06-17 | 11916 | 12 | 11491 | 6 |
339149 | MITx/3.091x/2012_Fall | 1 | 1 | 1 | 1 | United States | NaN | NaN | NaN | 0.9 | 2012-07-24 | 2013-07-23 | 61376 | 131 | 19179 | 16 |
339150 | MITx/8.02x/2013_Spring | 1 | 1 | 1 | 0 | United States | NaN | NaN | NaN | 0.12 | 2013-01-17 | 2013-08-11 | 45660 | 139 | 15487 | 18 |
353229 | MITx/6.00x/2012_Fall | 1 | 1 | 0 | 0 | Other South America | NaN | NaN | NaN | 0 | 2012-08-31 | 2012-12-19 | 197757 | 19 | 98517 | 5 |
440339 | MITx/6.002x/2012_Fall | 1 | 1 | 1 | 0 | Other Africa | Bachelor's | 1983 | m | 0.03 | 2012-09-12 | 2012-12-24 | 28383 | 43 | 24564 | 16 |
473105 | MITx/6.002x/2012_Fall | 1 | 1 | 1 | 1 | Spain | Bachelor's | 1993 | m | 0.71 | 2012-10-08 | 2013-01-03 | 16649 | 48 | 11458 | 15 |
unit 353229 deeply suspicious!
98517 video but only 5 chapters and 19 days of activity!
Clear as mud? Let's try another example. Can we pick out all the French?
df['final_cc_cname_DI']=="France"
0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 True 8 False 9 False 10 False 11 False 12 False 13 False 14 False ... 641123 False 641124 False 641125 False 641126 False 641127 False 641128 False 641129 False 641130 False 641131 False 641132 False 641133 False 641134 False 641135 False 641136 False 641137 False Name: final_cc_cname_DI, Length: 641138, dtype: bool
We can put the entire expression within our indices, so we can pick out just the rows involving France with:
france=df[df['final_cc_cname_DI']=="France"]
years=df['YoB'].value_counts()
years
1990 38524 1992 37846 1991 37544 1989 36831 1988 34436 1993 33999 1987 31313 1986 28725 1985 25539 1994 23921 1984 22138 1983 18869 1982 16867 1981 14477 1980 13806 ... 1939 86 1938 74 1937 64 2013 61 2002 45 1936 43 1935 36 2011 34 2010 17 2003 10 2008 10 2009 8 1931 7 2007 6 1934 5 Length: 78, dtype: int64
years.plot(kind="bar")
<matplotlib.axes.AxesSubplot at 0x7ff7294fd0d0>
Looks reasonably cool, but wait. Not light yonder: the table has the dates out of order. We need to tell .value_counts not to sort!
Excuse me while I google how to do that.
years=df['YoB'].value_counts(sort=False)
years
1931 7 1934 5 1935 36 1936 43 1937 64 1938 74 1939 86 1940 92 1941 96 1942 196 1943 223 1944 221 1945 216 1946 331 1947 460 ... 1996 7754 1997 4082 1998 1935 1999 952 2000 340 2001 142 2002 45 2003 10 2007 6 2008 10 2009 8 2010 17 2011 34 2012 472 2013 61 Length: 78, dtype: int64
years.plot(kind="bar", title="Individual course enrollments by birth year")
<matplotlib.axes.AxesSubplot at 0x7ff729ecb890>
What does the chart look like for India, rather than everywhere?
india=df['final_cc_cname_DI']=="India"
india_years=df[india]['YoB'].value_counts(sort=False)
india_years.plot(kind='bar', title="Indian enrollments by birth year")
<matplotlib.axes.AxesSubplot at 0x7ff72d81a550>
Documentation explains the different levels of completion of course.
"explored: administrative, 0/1; anyone who accessed at least half of the chapters in the courseware (chapters are the highest level on the “courseware” menu housing course content)."
explored=df[df['explored']==1]
len(explored)
39686
years_explorers=explored['YoB'].value_counts(sort=False)
years.plot(kind="bar")
<matplotlib.axes.AxesSubplot at 0x7ff72d64b8d0>
years_explorers.plot(kind='bar', title='Age of those who explored half the chapters')
<matplotlib.axes.AxesSubplot at 0x7ff72a807090>
Are there geographic differences in the amount of completion of the courses?
france_certified=france[df['certified']!=0]
france_certified[:10]
/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py:1712: UserWarning: Boolean Series key will be reindexed to match DataFrame index. "DataFrame index.", UserWarning)
course_id | registered | viewed | explored | certified | final_cc_cname_DI | LoE_DI | YoB | gender | grade | start_time_DI | last_event_DI | nevents | ndays_act | nplay_video | nchapters | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4111 | HarvardX/CS50x/2012 | 1 | 1 | 1 | 1 | France | NaN | NaN | NaN | 1 | 2012-09-26 | 2013-05-11 | 115 | 25 | NaN | 12 |
6318 | HarvardX/CS50x/2012 | 1 | 1 | 1 | 1 | France | NaN | NaN | NaN | 1 | 2012-07-26 | 2013-05-10 | 434 | 47 | NaN | 12 |
8138 | HarvardX/CS50x/2012 | 1 | 1 | 1 | 1 | France | NaN | NaN | NaN | 1 | 2012-07-28 | 2013-05-10 | 47 | 26 | NaN | 12 |
8343 | HarvardX/PH207x/2012_Fall | 1 | 1 | 1 | 1 | France | NaN | NaN | NaN | 0.9 | 2012-07-25 | 2013-07-15 | 5497 | 53 | 1165 | 16 |
10974 | HarvardX/CS50x/2012 | 1 | 1 | 1 | 1 | France | NaN | NaN | NaN | 1 | 2012-07-29 | 2013-05-24 | 311 | 23 | NaN | 12 |
13590 | HarvardX/CS50x/2012 | 1 | 1 | 1 | 1 | France | NaN | NaN | NaN | 1 | 2012-07-24 | 2013-05-12 | 683 | 50 | NaN | 12 |
17323 | HarvardX/PH278x/2013_Spring | 1 | 1 | 1 | 1 | France | NaN | NaN | NaN | 0.7 | 2012-12-25 | 2013-08-07 | 1416 | 29 | 48 | 9 |
53864 | HarvardX/CS50x/2012 | 1 | 1 | 1 | 1 | France | Bachelor's | 1992 | m | 1 | 2012-08-17 | 2013-08-09 | 342 | 73 | NaN | 12 |
73021 | HarvardX/CS50x/2012 | 1 | 1 | 1 | 1 | France | Master's | 1986 | m | 1 | 2012-08-30 | 2013-05-12 | 543 | 41 | NaN | 12 |
84967 | HarvardX/ER22x/2013_Spring | 1 | 1 | 1 | 1 | France | Bachelor's | 1990 | f | 0.84 | 2012-12-21 | 2013-07-31 | 1198 | 16 | NaN | 30 |
len(france) #how many?
4700
What proportion of the French users finish the course by getting certified?
len(france_certified)
204
len(france_certified)/len(france)
0
Ça ne marche pas!
float(len(france_certified))/len(france)
0.04340425531914894
So about 4.3% of self-reported French enrollees earn a certificate.
We could write this:
P(that a French enrollee earns a certificate)=.043
Or,
P(Some person completes a certificate given that the person is French)
If A means "some person completes a certificate" and B and means "someone is French", then we write
$P(A|B)=.043$
The vertical bar (|) means "given that."
Can we do it without computing each individually as we did for France?
Oui, for pandas
will let us do vector operations directly!
For France, we divided the total number of explorers over the total number of enrollees.
How can we do that division for every country all at once?
Let's compute the number of explorers and the number of enrollees for each country.
explored_by_country=explored['final_cc_cname_DI'].value_counts()
explored_by_country
United States 10132 India 7097 Other Europe 3973 United Kingdom 1759 Other Africa 1621 Spain 1415 Russian Federation 1135 Brazil 975 Unknown/Other 965 Other Middle East/Central Asia 875 Canada 874 Other South Asia 854 Germany 823 Poland 702 Other South America 666 Greece 518 Ukraine 469 Australia 443 Nigeria 438 Pakistan 437 France 436 Colombia 418 Egypt 414 Mexico 375 Other North & Central Amer., Caribbean 313 Other East Asia 304 Portugal 225 Indonesia 223 Philippines 219 China 176 Bangladesh 143 Morocco 132 Japan 112 Other Oceania 25 dtype: int64
enrolled_by_country=df['final_cc_cname_DI'].value_counts()
enrolled_by_country
United States 184240 India 88696 Unknown/Other 82029 Other Europe 40377 Other Africa 23897 United Kingdom 22131 Brazil 17856 Other Middle East/Central Asia 17325 Other South Asia 12992 Canada 12738 Pakistan 10824 Russian Federation 10432 Spain 10003 Other South America 9916 Egypt 9286 Germany 8074 Nigeria 7483 Other East Asia 6446 Australia 6419 Mexico 5638 Philippines 5374 Poland 5226 China 5170 Greece 5162 Colombia 4803 France 4700 Other North & Central Amer., Caribbean 4434 Ukraine 4100 Morocco 3966 Indonesia 3410 Bangladesh 3182 Japan 2270 Portugal 2193 Other Oceania 346 dtype: int64
And now to divide each country's exploration number by its enrollment number. We could do it with a for
loop over all the countries. But we can vectorize.
explored_by_country/enrolled_by_country
Australia 0.069014 Bangladesh 0.044940 Brazil 0.054603 Canada 0.068614 China 0.034043 Colombia 0.087029 Egypt 0.044583 France 0.092766 Germany 0.101932 Greece 0.100349 India 0.080015 Indonesia 0.065396 Japan 0.049339 Mexico 0.066513 Morocco 0.033283 Nigeria 0.058533 Other Africa 0.067833 Other East Asia 0.047161 Other Europe 0.098398 Other Middle East/Central Asia 0.050505 Other North & Central Amer., Caribbean 0.070591 Other Oceania 0.072254 Other South America 0.067164 Other South Asia 0.065733 Pakistan 0.040373 Philippines 0.040752 Poland 0.134328 Portugal 0.102599 Russian Federation 0.108800 Spain 0.141458 Ukraine 0.114390 United Kingdom 0.079481 United States 0.054993 Unknown/Other 0.011764 dtype: float64
Cool beans!
Think about what that division means: for each country, pick out the value in the explored vector for that country:
$A = \begin{pmatrix} exploration_{US} \\ exploration_{china} \\ exploration_{france} \end{pmatrix} $
and then pick out the value in the enrolled vector for that country:
$B = \begin{pmatrix} enrollment_{US} \\ enrollment_{china} \\ enrollment_{france} \end{pmatrix} $
Before we talked about different kinds of functions on vectors. Here we want corresponding terms to be divided.
$\begin{pmatrix} exploration_{US}/enrollment_{US} \\ exploration_{china}/enrollment_{china} \\ exploration_{france}/enrollment_{france} \\ \cdots \end{pmatrix}$
diligence_in_exploration=(explored_by_country/enrolled_by_country)
#Can we plot it? Yes!
diligence_in_exploration.plot(kind='bar', title='Diligence: Exploration per enrollment')
<matplotlib.axes.AxesSubplot at 0x7ff72a5e1750>
Some nations appear to have a much higher rate of diligence in exploring chapters.
And there appears to be a positive correlation of diligence and being between 'Po' and 'Uk' in the alphabet.
from IPython.display import Image
Image("https://sslimgs.xkcd.com/comics/correlation.png")
certified=df[df['certified']==1]
len(certified)
17687
certified_by_country=certified['final_cc_cname_DI'].value_counts()
diligence=(certified_by_country/enrolled_by_country)
diligence.plot(kind='bar', title='Diligence redux: certification per enrollment')
<matplotlib.axes.AxesSubplot at 0x7ff72a2b0510>
what if we compared completion by gender? or by age?
certified_male=certified[certified['gender']=='m']
certified_male_by_country=certified_male['final_cc_cname_DI'].value_counts()
enrolled_male=df[df['gender']=='m']
enrolled_male_by_country=enrolled_male['final_cc_cname_DI'].value_counts()
diligence_male=(certified_male_by_country/enrolled_male_by_country)
diligence_male.plot(kind='bar', title='Diligence: Certification per enrollment(dudes)')
<matplotlib.axes.AxesSubplot at 0x7ff729f74f10>
certified_female=certified[certified['gender']=='f']
certified_female_by_country=certified_female['final_cc_cname_DI'].value_counts()
enrolled_female=df[df['gender']=='f']
enrolled_female_by_country=enrolled_female['final_cc_cname_DI'].value_counts()
diligence_female=(certified_female_by_country/enrolled_female_by_country)
diligence_female.plot(kind='bar', title='Diligence: Certification per enrollment(lasses)')
<matplotlib.axes.AxesSubplot at 0x7ff729ceded0>
diligence_genders=pd.DataFrame([diligence_female, diligence_male], index=["female","male"])
diligence_genders
Australia | Bangladesh | Brazil | Canada | China | Colombia | Egypt | France | Germany | Greece | ... | Pakistan | Philippines | Poland | Portugal | Russian Federation | Spain | Ukraine | United Kingdom | United States | Unknown/Other | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
female | 0.040091 | 0.040984 | 0.021733 | 0.027979 | 0.022312 | 0.058757 | 0.019417 | 0.062500 | 0.076320 | 0.065415 | ... | 0.022563 | 0.010365 | 0.055306 | 0.084211 | 0.031308 | 0.090592 | 0.033624 | 0.044515 | 0.024095 | 0.000907 |
male | 0.026940 | 0.009324 | 0.024028 | 0.022749 | 0.009840 | 0.032022 | 0.010848 | 0.039514 | 0.052922 | 0.055712 | ... | 0.012190 | 0.019212 | 0.084135 | 0.047189 | 0.060886 | 0.077693 | 0.055294 | 0.031626 | 0.019549 | 0.000643 |
2 rows × 34 columns
diligence_genders.T
female | male | |
---|---|---|
Australia | 0.040091 | 0.026940 |
Bangladesh | 0.040984 | 0.009324 |
Brazil | 0.021733 | 0.024028 |
Canada | 0.027979 | 0.022749 |
China | 0.022312 | 0.009840 |
Colombia | 0.058757 | 0.032022 |
Egypt | 0.019417 | 0.010848 |
France | 0.062500 | 0.039514 |
Germany | 0.076320 | 0.052922 |
Greece | 0.065415 | 0.055712 |
India | 0.045643 | 0.033067 |
Indonesia | 0.041284 | 0.022642 |
Japan | 0.008811 | 0.013754 |
Mexico | 0.040107 | 0.021894 |
Morocco | 0.015075 | 0.006401 |
Nigeria | 0.035941 | 0.016881 |
Other Africa | 0.041943 | 0.021750 |
Other East Asia | 0.022238 | 0.019379 |
Other Europe | 0.049466 | 0.041956 |
Other Middle East/Central Asia | 0.027838 | 0.015398 |
Other North & Central Amer., Caribbean | 0.046016 | 0.032689 |
Other Oceania | 0.023810 | NaN |
Other South America | 0.045977 | 0.023626 |
Other South Asia | 0.035273 | 0.030398 |
Pakistan | 0.022563 | 0.012190 |
Philippines | 0.010365 | 0.019212 |
Poland | 0.055306 | 0.084135 |
Portugal | 0.084211 | 0.047189 |
Russian Federation | 0.031308 | 0.060886 |
Spain | 0.090592 | 0.077693 |
Ukraine | 0.033624 | 0.055294 |
United Kingdom | 0.044515 | 0.031626 |
United States | 0.024095 | 0.019549 |
Unknown/Other | 0.000907 | 0.000643 |
diligence_genders.T[["female","male"]].plot(kind="bar", title="Diligence: Certification per enrollment (lasses and dudes)")
<matplotlib.axes.AxesSubplot at 0x7ff72a2a3fd0>
/usr/lib/pymodules/python2.7/matplotlib/font_manager.py:1246: UserWarning: findfont: Could not match :family=Bitstream Vera Sans:style=normal:variant=normal:weight=normal:stretch=normal:size=large. Returning /usr/share/matplotlib/mpl-data/fonts/ttf/cmb10.ttf UserWarning)