Pandas in a Hurry*¶

*A Python version of Derren Barken's R in a Hurry.

Prepared and presented by Jaime Fernandez (jaime.frio@gmail.com) at the 2015 San Diego Data Science Fun Conference.

The latest version of this notebook and accompanying data files can be found at https://github.com/jaimefrio/data_science_fun_conference_2015

Table of Contents¶

The Fastest Pandas Tutorial of the West
- Pandas Series Objects
- Pandas DataFrame Objects
Pandas by Example: the STAR Dataset
Placeholder
Appendix. When You Munge Upon a STAR

The Fastest Pandas Tutorial of the West¶

Pandas (read the docs!) is the Python Data Analysis Library. It provides

"high-performance, easy-to-use data structures and data analysis tools for the Python programming language."

Before we get to use it, we need to import it.

In [1]:

import pandas as pd

And since we are at it let's import some more modules that we will be needing.

In [2]:

import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
%matplotlib inline

Pandas has two main objects, the Series and the Dataframe.

Pandas `Series` Objects¶

A Series (read the docs!) is a glorified Numpy array:

In [3]:

a = pd.Series(np.random.randn(5), name='random numbers')
a

Out[3]:

0    0.624188
1    0.818221
2    2.075770
3   -0.981804
4    0.368029
Name: random numbers, dtype: float64

In [4]:

# Series can be sliced
a[-3:]

Out[4]:

2    2.075770
3   -0.981804
4    0.368029
Name: random numbers, dtype: float64

In [5]:

# Series can be indexed by position
a[3]

Out[5]:

-0.98180410052244071

In [6]:

# Series can be indexed with arrays of indices
a[[1, 3, 2]]

Out[6]:

1    0.818221
3   -0.981804
2    2.075770
Name: random numbers, dtype: float64

In [7]:

# Series can be indexed with boolean arrays
a[[True, False, False, True, True]]

Out[7]:

0    0.624188
3   -0.981804
4    0.368029
Name: random numbers, dtype: float64

In [8]:

# Series can have labels as indices
a.index = list('abcde')  # list('abcde') --> ['a', 'b', 'c', 'd', 'e']
a

Out[8]:

a    0.624188
b    0.818221
c    2.075770
d   -0.981804
e    0.368029
Name: random numbers, dtype: float64

In [9]:

# Indexing with numbers (and slicing) still works
a[2]

Out[9]:

2.0757697852523171

In [10]:

# But we can also index with labels...
a['c']

Out[10]:

2.0757697852523171

In [11]:

# ...and arrays of labels
a[['c', 'b', 'a']]

Out[11]:

c    2.075770
b    0.818221
a    0.624188
Name: random numbers, dtype: float64

In [12]:

# Operations on Series align on the index (i)
a[-3:]

Out[12]:

c    2.075770
d   -0.981804
e    0.368029
Name: random numbers, dtype: float64

In [13]:

# Operations on Series align on the index (ii)
a[:3]

Out[13]:

a    0.624188
b    0.818221
c    2.075770
Name: random numbers, dtype: float64

In [14]:

# Operations on Series align on the index (iii)
a[-3:] + a[:3]

Out[14]:

a        NaN
b        NaN
c    4.15154
d        NaN
e        NaN
Name: random numbers, dtype: float64

In [15]:

# Compare to the same result for the underlying numpy array
a.values[-3:] + a.values[:3]

Out[15]:

array([ 2.69995774, -0.16358345,  2.44379836])

In [16]:

# A Series makes a great look-up table
gp_lut = pd.Series(data=[4, 3, 2, 1, 0], index=list('ABCDF'))
gp_lut

Out[16]:

A    4
B    3
C    2
D    1
F    0
dtype: int64

In [17]:

my_grades = ['A', 'B', 'B', 'A', 'A']
my_points = gp_lut[my_grades]
my_points

Out[17]:

A    4
B    3
B    3
A    4
A    4
dtype: int64

In [18]:

# Series have lots of numerical evaluation methods
my_gpa = my_points.mean()
my_gpa

Out[18]:

3.6000000000000001

Pandas `DataFrame` Objects¶

A DataFrame (read the docs!) is a two-dimensional table of heteregenous data. It is conceptually very similar to a dictionary of Series objects with a common index.

In [19]:

students = pd.DataFrame({'name' : ['Peter', 'Paul', 'Mary',
                                    'Peter', 'Paul', 'Mary'],
                         'subject' : ['English', 'English', 'English',
                                      'Math', 'Math', 'Math'],
                         'grade' : [85.0, 76.0, 92.0, 77.0, 68.0, 87.0]})
students

Out[19]:

	grade	name	subject
0	85	Peter	English
1	76	Paul	English
2	92	Mary	English
3	77	Peter	Math
4	68	Paul	Math
5	87	Mary	Math

In [20]:

# We can get the column names
students.columns

Out[20]:

Index(['grade', 'name', 'subject'], dtype='object')

In [21]:

# We can get a single Series indexing by column name...
students['name']

Out[21]:

0    Peter
1     Paul
2     Mary
3    Peter
4     Paul
5     Mary
Name: name, dtype: object

In [22]:

# ...or as an attribute
students.name

Out[22]:

0    Peter
1     Paul
2     Mary
3    Peter
4     Paul
5     Mary
Name: name, dtype: object

In [23]:

# We can get a subset of columns indexing with an array of labels
students[['subject', 'grade']]

Out[23]:

	subject	grade
0	English	85
1	English	76
2	English	92
3	Math	77
4	Math	68
5	Math	87

In [24]:

# To get a subset of rows, slicing and indexing still work
students[-3:]

Out[24]:

	grade	name	subject
3	77	Peter	Math
4	68	Paul	Math
5	87	Mary	Math

In [25]:

# But we can give boolean indexing an interesting twist
students[students.name == 'Mary']

Out[25]:

	grade	name	subject
2	92	Mary	English
5	87	Mary	Math

But enough of pandas in the void, lets look at our dataset.

Pandas by Example: The STAR Dataset¶

STAR is California's Standarized Testing and Reporting Program. You can get more information from the STAR website or from Wikipedia.

We are specifically going to look at the 2013 data. You can download the raw data for this tutorial from here:

http://www3.cde.ca.gov/starresearchfiles/2013/p3/ca2013_1_csv_v3.zip

Getting that data to a usable state, augmenting it with additional information, like income or sprending per student by county, is far from trivial. If we were to do that now, it would consume the whole tutorial. So we have done the ugly job for you, and you can download the clean version of the data from here:

star_2013_clean_wide.csv

This cleaning up of the data, munging it as data scientist refer to it, is an extremely important part of a data scientist job. Going over the process we followed to clean-up this data is covered in the appendix. Not as sexy as fitting models tot he data, but equally important.

But enough talking, let's start having fun with the cleaned-up data...

Importing a CSV file couldn't be easier with pandas' read_csv (read the docs!) function. It has dozens of optional configuration parameters, but the defaults work like a charm on properly formatted data.

In [26]:

data = pd.read_csv('star_2013_clean_wide.csv')
data

Out[26]:

	County.Code	County.Name	Test.Id	Test.Name	Grade	Students.Tested	Mean.Scale.Score	Count.Test.Grade	Pct.Test.Grade	Population	per.capita.income	median.household.income	median.family.income	Spend	ADAttend	Spend.Per.ADA
0	1	Alameda	7	CST English-Language Arts	2	16814	372.1	464896	0.036167	1494876	34937	70821	87012	1814932885	193906	9360
1	1	Alameda	8	CST Mathematics	2	16802	398.4	464515	0.036171	1494876	34937	70821	87012	1814932885	193906	9360
2	1	Alameda	8	CST Mathematics	3	16198	411.1	443961	0.036485	1494876	34937	70821	87012	1814932885	193906	9360
3	1	Alameda	7	CST English-Language Arts	3	16126	357.2	441572	0.036520	1494876	34937	70821	87012	1814932885	193906	9360
4	1	Alameda	7	CST English-Language Arts	4	15390	387.0	428906	0.035882	1494876	34937	70821	87012	1814932885	193906	9360
5	1	Alameda	8	CST Mathematics	4	15532	405.8	433012	0.035870	1494876	34937	70821	87012	1814932885	193906	9360
6	1	Alameda	8	CST Mathematics	5	15146	411.6	432775	0.034997	1494876	34937	70821	87012	1814932885	193906	9360
7	1	Alameda	7	CST English-Language Arts	5	15077	378.4	429498	0.035104	1494876	34937	70821	87012	1814932885	193906	9360
8	1	Alameda	32	CST Science - Grade 5, Grade 8, and Grade 10 L...	5	15107	380.6	431142	0.035040	1494876	34937	70821	87012	1814932885	193906	9360
9	1	Alameda	7	CST English-Language Arts	6	15277	374.9	434374	0.035170	1494876	34937	70821	87012	1814932885	193906	9360
10	1	Alameda	8	CST Mathematics	6	15332	380.3	436563	0.035120	1494876	34937	70821	87012	1814932885	193906	9360
11	1	Alameda	7	CST English-Language Arts	7	14551	378.4	431187	0.033746	1494876	34937	70821	87012	1814932885	193906	9360
12	1	Alameda	9	CST Algebra I	7	2239	431.6	37803	0.059228	1494876	34937	70821	87012	1814932885	193906	9360
13	1	Alameda	8	CST Mathematics	7	12326	370.2	393811	0.031299	1494876	34937	70821	87012	1814932885	193906	9360
14	1	Alameda	9	CST Algebra I	8	11006	354.2	276039	0.039871	1494876	34937	70821	87012	1814932885	193906	9360
15	1	Alameda	13	CST Algebra II	8	31	458.9	680	0.045588	1494876	34937	70821	87012	1814932885	193906	9360
16	1	Alameda	7	CST English-Language Arts	8	14549	373.9	435491	0.033408	1494876	34937	70821	87012	1814932885	193906	9360
17	1	Alameda	29	CST History - Social Science Grade 8	8	15421	363.3	459125	0.033588	1494876	34937	70821	87012	1814932885	193906	9360
18	1	Alameda	28	CST General Mathematics	8	2143	308.4	145549	0.014724	1494876	34937	70821	87012	1814932885	193906	9360
19	1	Alameda	11	CST Geometry	8	1732	428.0	29035	0.059652	1494876	34937	70821	87012	1814932885	193906	9360
20	1	Alameda	32	CST Science - Grade 5, Grade 8, and Grade 10 L...	8	14552	404.5	436071	0.033371	1494876	34937	70821	87012	1814932885	193906	9360
21	1	Alameda	10	CST Integrated Math 1	9	121	298.8	980	0.123469	1494876	34937	70821	87012	1814932885	193906	9360
22	1	Alameda	20	CST Biology	9	9473	377.0	228962	0.041374	1494876	34937	70821	87012	1814932885	193906	9360
23	1	Alameda	21	CST Chemistry	9	186	364.3	5365	0.034669	1494876	34937	70821	87012	1814932885	193906	9360
24	1	Alameda	24	CST Integrated/Coordinated Science 1	9	905	321.4	33944	0.026662	1494876	34937	70821	87012	1814932885	193906	9360
25	1	Alameda	22	CST Earth Science	9	1651	328.8	133961	0.012324	1494876	34937	70821	87012	1814932885	193906	9360
26	1	Alameda	23	CST Physics	9	754	312.8	14112	0.053430	1494876	34937	70821	87012	1814932885	193906	9360
27	1	Alameda	25	CST Integrated/Coordinated Science 2	9	124	323.0	1351	0.091784	1494876	34937	70821	87012	1814932885	193906	9360
28	1	Alameda	18	CST World History	9	597	320.1	35087	0.017015	1494876	34937	70821	87012	1814932885	193906	9360
29	1	Alameda	11	CST Geometry	9	6613	353.4	144063	0.045904	1494876	34937	70821	87012	1814932885	193906	9360
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3700	58	Yuba	22	CST Earth Science	10	290	327.0	30649	0.009462	71817	20046	46617	52775	99237214	11710	8474
3701	58	Yuba	15	CST Summative High School Mathematics	10	19	353.3	24757	0.000767	71817	20046	46617	52775	99237214	11710	8474
3702	58	Yuba	21	CST Chemistry	10	166	358.8	136209	0.001219	71817	20046	46617	52775	99237214	11710	8474
3703	58	Yuba	13	CST Algebra II	10	201	335.9	131448	0.001529	71817	20046	46617	52775	99237214	11710	8474
3704	58	Yuba	9	CST Algebra I	10	259	285.9	104567	0.002477	71817	20046	46617	52775	99237214	11710	8474
3705	58	Yuba	7	CST English-Language Arts	10	931	332.0	455362	0.002045	71817	20046	46617	52775	99237214	11710	8474
3706	58	Yuba	20	CST Biology	10	238	343.3	228138	0.001043	71817	20046	46617	52775	99237214	11710	8474
3707	58	Yuba	11	CST Geometry	10	274	296.1	156167	0.001755	71817	20046	46617	52775	99237214	11710	8474
3708	58	Yuba	11	CST Geometry	11	151	287.1	78308	0.001928	71817	20046	46617	52775	99237214	11710	8474
3709	58	Yuba	19	CST U.S. History	11	895	325.7	447386	0.002001	71817	20046	46617	52775	99237214	11710	8474
3710	58	Yuba	20	CST Biology	11	186	330.3	99776	0.001864	71817	20046	46617	52775	99237214	11710	8474
3711	58	Yuba	7	CST English-Language Arts	11	899	324.9	440115	0.002043	71817	20046	46617	52775	99237214	11710	8474
3712	58	Yuba	9	CST Algebra I	11	187	281.3	49868	0.003750	71817	20046	46617	52775	99237214	11710	8474
3713	58	Yuba	13	CST Algebra II	11	199	302.4	122079	0.001630	71817	20046	46617	52775	99237214	11710	8474
3714	58	Yuba	21	CST Chemistry	11	121	342.4	133804	0.000904	71817	20046	46617	52775	99237214	11710	8474
3715	58	Yuba	15	CST Summative High School Mathematics	11	153	322.6	124304	0.001231	71817	20046	46617	52775	99237214	11710	8474
3716	58	Yuba	22	CST Earth Science	11	187	326.8	42331	0.004418	71817	20046	46617	52775	99237214	11710	8474
3717	58	Yuba	23	CST Physics	11	91	363.8	56726	0.001604	71817	20046	46617	52775	99237214	11710	8474
3718	58	Yuba	18	CST World History	11	40	300.3	16163	0.002475	71817	20046	46617	52775	99237214	11710	8474
3719	58	Yuba	18	CST World History	13	874	329.1	474255	0.001843	71817	20046	46617	52775	99237214	11710	8474
3720	58	Yuba	25	CST Integrated/Coordinated Science 2	13	14	302.3	3742	0.003741	71817	20046	46617	52775	99237214	11710	8474
3721	58	Yuba	20	CST Biology	13	876	340.6	556893	0.001573	71817	20046	46617	52775	99237214	11710	8474
3722	58	Yuba	22	CST Earth Science	13	875	328.9	206991	0.004227	71817	20046	46617	52775	99237214	11710	8474
3723	58	Yuba	23	CST Physics	13	92	363.6	80833	0.001138	71817	20046	46617	52775	99237214	11710	8474
3724	58	Yuba	28	CST General Mathematics	13	585	299.5	190615	0.003069	71817	20046	46617	52775	99237214	11710	8474
3725	58	Yuba	13	CST Algebra II	13	431	320.6	286737	0.001503	71817	20046	46617	52775	99237214	11710	8474
3726	58	Yuba	21	CST Chemistry	13	287	351.9	275452	0.001042	71817	20046	46617	52775	99237214	11710	8474
3727	58	Yuba	15	CST Summative High School Mathematics	13	172	326.0	149987	0.001147	71817	20046	46617	52775	99237214	11710	8474
3728	58	Yuba	11	CST Geometry	13	695	307.8	407658	0.001705	71817	20046	46617	52775	99237214	11710	8474
3729	58	Yuba	9	CST Algebra I	13	1495	319.4	711705	0.002101	71817	20046	46617	52775	99237214	11710	8474

3730 rows × 16 columns

In [27]:

# A closer look at the available columns
data.columns

Out[27]:

Index(['County.Code', 'County.Name', 'Test.Id', 'Test.Name', 'Grade', 'Students.Tested', 'Mean.Scale.Score', 'Count.Test.Grade', 'Pct.Test.Grade', 'Population', 'per.capita.income', 'median.household.income', 'median.family.income', 'Spend', 'ADAttend', 'Spend.Per.ADA'], dtype='object')

In [28]:

# What are the unique test names?
data['Test.Name'].unique()

Out[28]:

array(['CST English-Language Arts', 'CST Mathematics',
       'CST Science - Grade 5, Grade 8, and Grade 10 Life Science',
       'CST Algebra I', 'CST Algebra II',
       'CST History - Social Science Grade 8', 'CST General Mathematics',
       'CST Geometry', 'CST Integrated Math 1', 'CST Biology',
       'CST Chemistry', 'CST Integrated/Coordinated Science 1',
       'CST Earth Science', 'CST Physics',
       'CST Integrated/Coordinated Science 2', 'CST World History',
       'CST Summative High School Mathematics', 'CST Integrated Math 2',
       'CST U.S. History', 'CST Integrated/Coordinated Science 4',
       'CST Integrated Math 3', 'CST Integrated/Coordinated Science 3'], dtype=object)

How many students took each test by grade?¶

Which introduces us to the pivot_table (read the docs!) method of DataFrames.

In [29]:

data.pivot_table(values='Students.Tested', index='Test.Name',
                 columns='Grade', aggfunc='sum')

Out[29]:

Grade	2	3	4	5	6	7	8	9	10	11	13
Test.Name
CST Algebra I	NaN	NaN	NaN	NaN	NaN	37803	276039	243349	104555	49846	711671
CST Algebra II	NaN	NaN	NaN	NaN	NaN	NaN	680	32400	131448	122079	286737
CST Biology	NaN	NaN	NaN	NaN	NaN	NaN	NaN	228962	228138	99753	556867
CST Chemistry	NaN	NaN	NaN	NaN	NaN	NaN	NaN	5365	136209	133804	275452
CST Earth Science	NaN	NaN	NaN	NaN	NaN	NaN	NaN	133961	30649	42316	206974
CST English-Language Arts	464896	441572	428906	429498	434374	431187	435491	463195	455237	439972	NaN
CST General Mathematics	NaN	NaN	NaN	NaN	NaN	NaN	145549	45052	NaN	NaN	190615
CST Geometry	NaN	NaN	NaN	NaN	NaN	NaN	29035	144063	156167	78308	407658
CST History - Social Science Grade 8	NaN	NaN	NaN	NaN	NaN	NaN	459125	NaN	NaN	NaN	NaN
CST Integrated Math 1	NaN	NaN	NaN	NaN	NaN	NaN	125	980	5965	6318	13486
CST Integrated Math 2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	236	1432	3364	5102
CST Integrated Math 3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	61	153	530	769
CST Integrated/Coordinated Science 1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	33944	4401	7963	46386
CST Integrated/Coordinated Science 2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1351	1413	955	3742
CST Integrated/Coordinated Science 3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	11	46	874	952
CST Integrated/Coordinated Science 4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	18	NaN	NaN	43
CST Mathematics	464515	443961	433012	432775	436563	393811	NaN	NaN	NaN	NaN	NaN
CST Physics	NaN	NaN	NaN	NaN	NaN	NaN	NaN	14112	9902	56726	80833
CST Science - Grade 5, Grade 8, and Grade 10 Life Science	NaN	NaN	NaN	431142	NaN	NaN	436071	NaN	451253	NaN	NaN
CST Summative High School Mathematics	NaN	NaN	NaN	NaN	NaN	NaN	NaN	809	24757	124304	149987
CST U.S. History	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	447253	NaN
CST World History	NaN	NaN	NaN	NaN	NaN	NaN	NaN	35087	422893	16163	474237

How do English scores vary by grade?¶

We are going to use a very powerful feature of DataFrames, the groupby (read the docs!) method.

The section of the docs on the split-apply-combine philosophy of Group By operations (read the docs!) is well worth a careful read.

In [30]:

# First, keep the English test rows only
english = data[data['Test.Name'] == 'CST English-Language Arts']

In [31]:

english_grade = english.groupby('Grade')['Mean.Scale.Score'].aggregate(np.mean)
english_grade

Out[31]:

Grade
2        355.771930
3        343.361404
4        370.684211
5        363.074138
6        361.592982
7        366.333333
8        362.998276
9        360.000000
10       344.621053
11       340.910526
Name: Mean.Scale.Score, dtype: float64

In [32]:

english_grade.plot()

Out[32]:

<matplotlib.axes._subplots.AxesSubplot at 0x109dc0400>

Wait a minute! Was that calculation correct?

Not really! What we have is the average accross counties, not accross students. Lets try to fix that.

In [33]:

# We add a new calculated column to our dataframe
english['Weighted.Score'] = english['Mean.Scale.Score'] * english['Pct.Test.Grade']
# Then group by grade and plot the aggregated sum of this column
english.groupby('Grade')['Weighted.Score'].aggregate(np.sum).plot()

Out[33]:

<matplotlib.axes._subplots.AxesSubplot at 0x10a1c0f60>

What's the effect of income on 7th grade math scores?¶

In [34]:

# Keep the relevant rows only
math_7th = data[(data['Test.Name'] == 'CST Mathematics') &
                (data['Grade'] == 7)]

In [35]:

math_7th.plot(x='median.family.income', y='Mean.Scale.Score', kind='scatter')

Out[35]:

<matplotlib.axes._subplots.AxesSubplot at 0x10a2d6940>

What's with the outlier? Let's try to figure it out...

In [36]:

# Let's peek at the offending row
math_7th[math_7th['Mean.Scale.Score'] < 300]

Out[36]:

	County.Code	County.Name	Test.Id	Test.Name	Grade	Students.Tested	Mean.Scale.Score	Count.Test.Grade	Pct.Test.Grade	Population	per.capita.income	median.household.income	median.family.income	Spend	ADAttend	Spend.Per.ADA
2918	46	Sierra	8	CST Mathematics	7	27	259.5	393811	0.000069	3277	26137	50308	56469	4739373	354	13391

Only 27 students tested, that doesn't seem like very many. Let's see how it compares to other counties.

In [37]:

math_7th.plot(x='Students.Tested', y='Mean.Scale.Score',
              kind='scatter', logx=True)

Out[37]:

<matplotlib.axes._subplots.AxesSubplot at 0x10a231710>

So Sierra seems to be a legit outlier. Let's remove it from our analysis.

In [38]:

math_7th = math_7th[math_7th['County.Name'] != 'Sierra']
math_7th.plot(x='median.family.income', y='Mean.Scale.Score', kind='scatter')

Out[38]:

<matplotlib.axes._subplots.AxesSubplot at 0x10a8ed9b0>

In [39]:

# Let's fit a linear model to our data
y = math_7th['Mean.Scale.Score']  # response
X = math_7th['median.family.income']  # predictor
X = sm.add_constant(X)  # Add a constant term to the predictor
# The actual fitting happens here
est = sm.OLS(y, X)
est = est.fit()
est.summary()

Out[39]:

OLS Regression Results
Dep. Variable:	Mean.Scale.Score	R-squared:	0.298
Model:	OLS	Adj. R-squared:	0.285
Method:	Least Squares	F-statistic:	22.93
Date:	Thu, 26 Feb 2015	Prob (F-statistic):	1.35e-05
Time:	22:58:08	Log-Likelihood:	-212.31
No. Observations:	56	AIC:	428.6
Df Residuals:	54	BIC:	432.7
Df Model:	1

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
const	332.0775	6.059	54.808	0.000	319.930 344.225
median.family.income	0.0004	8.87e-05	4.788	0.000	0.000 0.001

Omnibus:	0.749	Durbin-Watson:	1.990
Prob(Omnibus):	0.688	Jarque-Bera (JB):	0.280
Skew:	0.142	Prob(JB):	0.869
Kurtosis:	3.197	Cond. No.	2.84e+05

In [40]:

# Let's plot the regression line on top of the data
x_ = np.array([X.min(), X.max()])
y_ = est.predict(x_)
math_7th.plot(x='median.family.income', y='Mean.Scale.Score', kind='scatter')
plt.plot(x_[:, 1], y_, 'r-')

Out[40]:

[<matplotlib.lines.Line2D at 0x10ad8d320>]

What's the effect of spending on 7th grade math scores?¶

In [41]:

math_7th.plot(x='Spend.Per.ADA', y='Mean.Scale.Score', kind='scatter')

Out[41]:

<matplotlib.axes._subplots.AxesSubplot at 0x10b1e3748>

In [42]:

# Let's fit a linear model to our data
y = math_7th['Mean.Scale.Score']  # response
X = math_7th['Spend.Per.ADA']  # predictor
X = sm.add_constant(X)  # Add a constant term to the predictor
# The actual fitting happens here
est = sm.OLS(y, X)
est = est.fit()
est.summary()

Out[42]:

OLS Regression Results
Dep. Variable:	Mean.Scale.Score	R-squared:	0.064
Model:	OLS	Adj. R-squared:	0.047
Method:	Least Squares	F-statistic:	3.691
Date:	Thu, 26 Feb 2015	Prob (F-statistic):	0.0600
Time:	22:58:09	Log-Likelihood:	-220.37
No. Observations:	56	AIC:	444.7
Df Residuals:	54	BIC:	448.8
Df Model:	1

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
const	336.2436	12.602	26.682	0.000	310.979 361.509
Spend.Per.ADA	0.0026	0.001	1.921	0.060	-0.000 0.005

Omnibus:	0.563	Durbin-Watson:	2.081
Prob(Omnibus):	0.755	Jarque-Bera (JB):	0.697
Skew:	0.142	Prob(JB):	0.706
Kurtosis:	2.533	Cond. No.	6.92e+04

In [43]:

# Let's plot the regression line on top of the data
x_ = np.array([X.min(), X.max()])
y_ = est.predict(x_)
math_7th.plot(x='Spend.Per.ADA', y='Mean.Scale.Score', kind='scatter')
plt.plot(x_[:, 1], y_, 'r-')

Out[43]:

[<matplotlib.lines.Line2D at 0x10b184a20>]

What's the effect of spending and income on 7th grade math scores?¶

In [44]:

# Let's fit a linear model to our data
y = math_7th['Mean.Scale.Score']  # response
X = math_7th[['median.family.income', 'Spend.Per.ADA']]  # predictor
X = sm.add_constant(X)  # Add a constant term to the predictor
# The actual fitting happens here
est = sm.OLS(y, X)
est = est.fit()
est.summary()

Out[44]:

OLS Regression Results
Dep. Variable:	Mean.Scale.Score	R-squared:	0.359
Model:	OLS	Adj. R-squared:	0.335
Method:	Least Squares	F-statistic:	14.82
Date:	Thu, 26 Feb 2015	Prob (F-statistic):	7.71e-06
Time:	22:58:09	Log-Likelihood:	-209.78
No. Observations:	56	AIC:	425.6
Df Residuals:	53	BIC:	431.6
Df Model:	2

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
const	308.8693	11.900	25.955	0.000	285.000 332.738
median.family.income	0.0004	8.56e-05	4.935	0.000	0.000 0.001
Spend.Per.ADA	0.0025	0.001	2.239	0.029	0.000 0.005

Omnibus:	0.769	Durbin-Watson:	1.849
Prob(Omnibus):	0.681	Jarque-Bera (JB):	0.865
Skew:	-0.189	Prob(JB):	0.649
Kurtosis:	2.523	Cond. No.	5.82e+05

Placeholder second hour¶

Now that you have completed the tutorial, you are ready to fly on your own... For the rest of the session, you are welcome to try out one of the following projects, and ask for help from the volunteers. The projects are roughly ordered by increasing difficulty level. Please be aware that the harder projects are very unlikely to be completed in an hour (unless you happen to be Jeff Dean).

Project 1. Run the Pandas code above on your own¶

If you are new to Pandas, going over the tutorial we have just completed at your own pace may already be enough of a challenge.

Project 2. Run the data munging code¶

The data file we used for the tutorial had already been cleaned for you. The code to do that is explained in the appendix. Going over it is our second proposal.

Project 3. Analyze subgroup data like gender, ethnicity...¶

There is a treasure trove of additional information breaking down test scores by interesting covariates like gender, ethnicity, parental education, and more.

This data is coded by using an additional column called “subgroup”. The key for what the subgroup numbers mean (for 2013) appears here:

http://star.cde.ca.gov/star2013/research_fixfileformat.aspx (Scroll down, starting at “Gender”).

In the tutorial we created a graph showing English scores over time (grades 2-11). Using the subgroup data, can you make a similar chart, but with two lines, one for boys and one for girls? What other interesting insights can you glean from the subgroup data?

Project 4. Analyze data by geographical latitude¶

It’s SoCal vs NoCal... Every district has a latitude. Is latitude a useful predictor? Are there any differences between SoCal and NoCal schools? If you find an effect, can you demonstrate that it is valid even when you control for covariates such as income which might be correlated to latitude?

Project 5. Analyze data at the district or school level¶

All of the data analyzed in the tutorial was at the county level. However, the actual STAR data is broken down to the district and school levels.

Do counties with smaller schools get better scores than counties with larger schools? Do counties with higher income tend to have smaller or larger schools? How would you measure school size within a county? Average? Median?

Can you find district level data for income or spending online? Can you merge it with the district level data for student scores?

If you can find (and process/merge) district level data for income or spending, when you see a relationship that seems to hold at the county level (e.g. income to test scores), does that hold uniformly within all the counties? Looking at districts within counties, are there some counties where the relationship holds and some where it doesn’t?

Project 6. Analyze data longitudinally¶

All of the data analyzed in the tutorial was from 2013. There were test scores from multiple grades, but each of those grades was tested in 2013. However, there is more data available at the STAR website. Specifically, there are 16 years worth of data on the STAR website, from 1998-2013. You can access this data at http://star.cde.ca.gov/.

Because this data is available, if you combine this data appropriately, you can construct true longitudinal data, rather than cross sectional data. For example, the 11th graders of 2013 were the 10th graders of 2012, the 9th graders of 2011, etc...

Let’s look at an example. In the tutorial we made a slide with English scores by grade. That data was cross sectional. However, we could make the same type of plot with multiple lines – one per cohort. For example, there could be a line that connects scores for the 2nd graders of 1998, the 3rd graders of 1999, the 4th graders of 2000, etc... Another line could connect the scores for the 2nd graders of 1999, the 3rd graders of 2000, the 4th graders of 2001, etc...

Looking at the resulting plot, with multiple lines for multiple cohorts, one could get a global picture of whether English education in California, as judged by test scores across all grades, has been getting better or worse over the last 16 years. Are there any identifiable trends? What about other tests?

Bonus question: imagine you had a thousand tests – can you devise an automated way to look for trends in data like this? Does your automated test catch the trends you identified “by inspection”?

Appendix. When You Munge Upon a STAR¶

Download the 2013 California Statewide research file, All Students, comma delimited file: ca2013_1_csv_v3.zip
Unzip its contents. You will now have two files, the test data, ca2013_1_csv_v3.txt, and some information on the names of counties, districts and schools, ca2013entities_csv.txt.
Let's load the test data and start playing with it:

In [45]:

raw_data = pd.read_csv('ca2013_1_csv_v3.txt')

In [46]:

# Let's list the columns in the loaded data frame
raw_data.columns

Out[46]:

Index(['County Code', 'District Code', 'School Code', 'Charter Number', 'Test Year', 'Subgroup ID', 'Test Type', 'CAPA Assessment Level', 'Total STAR Enrollment', 'Total Tested At Entity Level', 'Total Tested At Subgroup Level', 'Grade', 'Test Id', 'STAR Reported Enrollment/CAPA Eligible', 'Students Tested', 'Percent Tested', 'Mean Scale Score', 'Percentage Advanced', 'Percentage Proficient', 'Percentage At Or Above Proficient', 'Percentage Basic', 'Percentage Below Basic', 'Percentage Far Below Basic', 'Students with Scores'], dtype='object')

We don't need all the columns, so we are going to keep a subset of them:

In [47]:

raw_data = raw_data[['County Code', 'District Code', 'Grade', 'Test Id', 'Students Tested', 'Mean Scale Score']]

The way data is encoded at the various levels is a little tricky. If say 'District Code' is zero, the that row holds data aggregated for all districts. To get data aggregated at the county level we need to get rid of the rows where 'County Code' is zero (those hold aggregated data for all of California), as well as all rows where 'District Code' is not zero.

In [48]:

raw_data = raw_data[(raw_data['County Code'] != 0) & (raw_data['District Code'] == 0)]

We can remove the 'District Code' column which we will not be needing any more.

In [49]:

del raw_data['District Code']

Let's take a peek at the data_frame, to see how we are doing.

In [50]:

raw_data[:5]

Out[50]:

	County Code	Grade	Test Id	Students Tested	Mean Scale Score
196	1	2	7	17260	368.2
197	1	2	8	17249	393.5
198	1	3	7	16276	357.2
199	1	3	8	16364	414.1
200	1	4	7	15936	385.9

We want to identify the counties by name, not by code. That information is stored in the other file we downloaded, let's use it.

In [51]:

entities = pd.read_csv('ca2013entities_csv.txt')

In [52]:

# Let's take a look at the columns here
entities.columns

Out[52]:

Index(['County Code', 'District Code', 'School Code', 'Charter Number', 'Test Year', 'Type Id', 'County Name', 'District Name', 'School Name', 'Zip Code'], dtype='object')

In [53]:

# Let's keep what we need only
entities = entities[['County Code', 'District Code', 'County Name']]

In [54]:

# And let's get rid of the non county specific data
entities = entities[(entities['County Code'] != 0) & (entities['District Code'] == 0)]
entities[:5]

Out[54]:

	County Code	County Name
1	1	Alameda
414	2	Alpine
420	3	Amador
438	4	Butte
548	5	Calaveras

In [55]:

# We can now get rid of the 'District Code' column
del entities['District Code']

In [56]:

# And make the 'County Code' column the index of the data frame
entities.set_index('County Code', inplace=True)

In [57]:

# Lastly, we use this as a LUT for the raw_data 'County Code'. NOtice the use of
# .loc to ensure the indexing is label, not position, based
county_names = entities.loc[raw_data['County Code']]

In [58]:

# And we add this information as a new column...
county_names.index = raw_data.index
raw_data['County Name'] = county_names
raw_data[:5]

Out[58]:

	County Code	Grade	Test Id	Students Tested	Mean Scale Score	County Name
196	1	2	7	17260	368.2	Alameda
197	1	2	8	17249	393.5	Alameda
198	1	3	7	16276	357.2	Alameda
199	1	3	8	16364	414.1	Alameda
200	1	4	7	15936	385.9	Alameda

Similarly, we don't want the 'Test Id', but a descriptive name of the test. This information is a little hidden, but can be found in table C in the Research File Layout page of the STAR website. We need that table in a more readable format. You can find it as a CSV file in xxx, obtained by copy-pasting the HTML into a text editor, replacing tabs with commas, and enclosing in quotes the names that had commas in them

In [59]:

test_name_lut = pd.read_csv('test_names.csv')
test_name_lut.set_index('Test ID', inplace=True)
test_names = test_name_lut.loc[raw_data['Test Id']]
test_names.index = raw_data.index
raw_data['Test Name'] = test_names
raw_data[:5]

Out[59]:

	County Code	Grade	Test Id	Students Tested	Mean Scale Score	County Name	Test Name
196	1	2	7	17260	368.2	Alameda	CST English-Language Arts
197	1	2	8	17249	393.5	Alameda	CST Mathematics
198	1	3	7	16276	357.2	Alameda	CST English-Language Arts
199	1	3	8	16364	414.1	Alameda	CST Mathematics
200	1	4	7	15936	385.9	Alameda	CST English-Language Arts

It isn't immediately obvious, but the 'Mean Scale Score' is not numeric. This is due to the use of * to indicate missing data. We could fix this by reloading the data using the na_values keyoword, i.e. raw_data = pd.read_csv('ca2013_1_csv_v3.txt', na_values=['*']), but we can fix it without redoing all the previous work using the .convert_objects method.

In [60]:

raw_data['Mean Scale Score'] = raw_data['Mean Scale Score'].convert_objects(convert_numeric=True)
raw_data['Mean Scale Score'].dtype

Out[60]:

dtype('float64')

We still need the county income information. You can get that from Wikipedia, going through the copy and paste process. To spare you that, that same data can be found in income.csv. We can similarly add the new columns, this time using the county name as index for our look up. To actually get this to work, we are going to have to get rid of the rows with 'County Code' set to 99 (the 'CA Education Authority' rows).

In [61]:

raw_data = raw_data[raw_data['County Code'] != 99]
income = pd.read_csv('income.csv')
income.set_index('County', inplace=True)
income = income.loc[raw_data['County Name']]
income.index = raw_data.index
raw_data = raw_data.join(income)
raw_data[:5]

Out[61]:

	County Code	Grade	Test Id	Students Tested	Mean Scale Score	County Name	Test Name	Population	Per capita income	Median household income	Median family income
196	1	2	7	17260	368.2	Alameda	CST English-Language Arts	1494876	34937	70821	87012
197	1	2	8	17249	393.5	Alameda	CST Mathematics	1494876	34937	70821	87012
198	1	3	7	16276	357.2	Alameda	CST English-Language Arts	1494876	34937	70821	87012
199	1	3	8	16364	414.1	Alameda	CST Mathematics	1494876	34937	70821	87012
200	1	4	7	15936	385.9	Alameda	CST English-Language Arts	1494876	34937	70821	87012

We also need the spending per pupil in every county. Data is available from the California Department of Education. The data for this year is in the Excel file currentexpense1213.xls, which we can load into Pandas with the read_excel command and some minimal tweaking.

In [62]:

expense_lut = pd.read_excel('currentexpense1213.xls',
                            sheetname='County Averages', skiprows=5)
expense_lut.set_index('County', inplace=True)
expenses = expense_lut.loc[raw_data['County Name']]
expenses.index = raw_data.index
raw_data = raw_data.join(expenses)
raw_data[:5]

Out[62]:

	County Code	Grade	Test Id	Students Tested	Mean Scale Score	County Name	Test Name	Population	Per capita income	Median household income	Median family income	Expenditures (EDP 365)	Current Expense ADA	Current Expense Per ADA
196	1	2	7	17260	368.2	Alameda	CST English-Language Arts	1494876	34937	70821	87012	1.716407e+09	191964.88	8941.253984
197	1	2	8	17249	393.5	Alameda	CST Mathematics	1494876	34937	70821	87012	1.716407e+09	191964.88	8941.253984
198	1	3	7	16276	357.2	Alameda	CST English-Language Arts	1494876	34937	70821	87012	1.716407e+09	191964.88	8941.253984
199	1	3	8	16364	414.1	Alameda	CST Mathematics	1494876	34937	70821	87012	1.716407e+09	191964.88	8941.253984
200	1	4	7	15936	385.9	Alameda	CST English-Language Arts	1494876	34937	70821	87012	1.716407e+09	191964.88	8941.253984

The only thing left is computing the column 'Pct Test Grade' we used to compute the weighted average. What this has is what percentage of the total population taking that test in that grade is represented in that row.

In [63]:

grouped = raw_data.groupby(['Grade', 'Test Id'])['Students Tested']
raw_data['Pct Test Grade'] = grouped.apply(lambda x: x / x.sum())
raw_data[:5]

Out[63]:

	County Code	Grade	Test Id	Students Tested	Mean Scale Score	County Name	Test Name	Population	Per capita income	Median household income	Median family income	Expenditures (EDP 365)	Current Expense ADA	Current Expense Per ADA	Pct Test Grade
196	1	2	7	17260	368.2	Alameda	CST English-Language Arts	1494876	34937	70821	87012	1.716407e+09	191964.88	8941.253984	0.036883
197	1	2	8	17249	393.5	Alameda	CST Mathematics	1494876	34937	70821	87012	1.716407e+09	191964.88	8941.253984	0.036883
198	1	3	7	16276	357.2	Alameda	CST English-Language Arts	1494876	34937	70821	87012	1.716407e+09	191964.88	8941.253984	0.036686
199	1	3	8	16364	414.1	Alameda	CST Mathematics	1494876	34937	70821	87012	1.716407e+09	191964.88	8941.253984	0.036689
200	1	4	7	15936	385.9	Alameda	CST English-Language Arts	1494876	34937	70821	87012	1.716407e+09	191964.88	8941.253984	0.036812

Do note that the column names we have used in this appendix do not correspond to the wants in the tutorial above. That data was generated in R, and we just went with the flow. If we wanted to rename them we could of course do it.

In [64]:

raw_data.rename(columns={'County Code': 'County.Code',
                        'Test Id': 'Test.Id',
                        'Students Tested': 'Students.Tested',
                        'Mean Scale Score': 'Mean.Scale.Score',
                        'County Name': 'County.Name',
                        'Test Name': 'Test.Name',
                        'Per capita income': 'per.capita.income',
                        'Median household income': 'median.household.income',
                        'Median family income': 'median.family.income',
                        'Expenditures (EDP 365)': 'Spend',
                        'Current Expense ADA': 'ADAttend',
                        'Current Expense Per ADA': 'Spend.Per.ADA',
                        'Pct Test Grade': 'Pct.Test.Grade'}, inplace=True)
raw_data[:5]

Out[64]:

	County.Code	Grade	Test.Id	Students.Tested	Mean.Scale.Score	County.Name	Test.Name	Population	per.capita.income	median.household.income	median.family.income	Spend	ADAttend	Spend.Per.ADA	Pct.Test.Grade
196	1	2	7	17260	368.2	Alameda	CST English-Language Arts	1494876	34937	70821	87012	1.716407e+09	191964.88	8941.253984	0.036883
197	1	2	8	17249	393.5	Alameda	CST Mathematics	1494876	34937	70821	87012	1.716407e+09	191964.88	8941.253984	0.036883
198	1	3	7	16276	357.2	Alameda	CST English-Language Arts	1494876	34937	70821	87012	1.716407e+09	191964.88	8941.253984	0.036686
199	1	3	8	16364	414.1	Alameda	CST Mathematics	1494876	34937	70821	87012	1.716407e+09	191964.88	8941.253984	0.036689
200	1	4	7	15936	385.9	Alameda	CST English-Language Arts	1494876	34937	70821	87012	1.716407e+09	191964.88	8941.253984	0.036812

And to round it all off, we save our dataframe to a csv file.

In [65]:

raw_data.to_csv('star_2013_cleaned_up.csv', index=False)