*A Python version of Derren Barken's R in a Hurry.
Prepared and presented by Jaime Fernandez (jaime.frio@gmail.com
) at the 2015 San Diego Data Science Fun Conference.
The latest version of this notebook and accompanying data files can be found at https://github.com/jaimefrio/data_science_fun_conference_2015
Pandas (read the docs!) is the Python Data Analysis Library. It provides
"high-performance, easy-to-use data structures and data analysis tools for the Python programming language."
Before we get to use it, we need to import it.
import pandas as pd
And since we are at it let's import some more modules that we will be needing.
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
%matplotlib inline
Pandas has two main objects, the Series
and the Dataframe
.
Series
Objects¶A Series
(read the docs!) is a glorified Numpy array:
a = pd.Series(np.random.randn(5), name='random numbers')
a
0 0.624188 1 0.818221 2 2.075770 3 -0.981804 4 0.368029 Name: random numbers, dtype: float64
# Series can be sliced
a[-3:]
2 2.075770 3 -0.981804 4 0.368029 Name: random numbers, dtype: float64
# Series can be indexed by position
a[3]
-0.98180410052244071
# Series can be indexed with arrays of indices
a[[1, 3, 2]]
1 0.818221 3 -0.981804 2 2.075770 Name: random numbers, dtype: float64
# Series can be indexed with boolean arrays
a[[True, False, False, True, True]]
0 0.624188 3 -0.981804 4 0.368029 Name: random numbers, dtype: float64
# Series can have labels as indices
a.index = list('abcde') # list('abcde') --> ['a', 'b', 'c', 'd', 'e']
a
a 0.624188 b 0.818221 c 2.075770 d -0.981804 e 0.368029 Name: random numbers, dtype: float64
# Indexing with numbers (and slicing) still works
a[2]
2.0757697852523171
# But we can also index with labels...
a['c']
2.0757697852523171
# ...and arrays of labels
a[['c', 'b', 'a']]
c 2.075770 b 0.818221 a 0.624188 Name: random numbers, dtype: float64
# Operations on Series align on the index (i)
a[-3:]
c 2.075770 d -0.981804 e 0.368029 Name: random numbers, dtype: float64
# Operations on Series align on the index (ii)
a[:3]
a 0.624188 b 0.818221 c 2.075770 Name: random numbers, dtype: float64
# Operations on Series align on the index (iii)
a[-3:] + a[:3]
a NaN b NaN c 4.15154 d NaN e NaN Name: random numbers, dtype: float64
# Compare to the same result for the underlying numpy array
a.values[-3:] + a.values[:3]
array([ 2.69995774, -0.16358345, 2.44379836])
# A Series makes a great look-up table
gp_lut = pd.Series(data=[4, 3, 2, 1, 0], index=list('ABCDF'))
gp_lut
A 4 B 3 C 2 D 1 F 0 dtype: int64
my_grades = ['A', 'B', 'B', 'A', 'A']
my_points = gp_lut[my_grades]
my_points
A 4 B 3 B 3 A 4 A 4 dtype: int64
# Series have lots of numerical evaluation methods
my_gpa = my_points.mean()
my_gpa
3.6000000000000001
DataFrame
Objects¶A DataFrame
(read the docs!) is a two-dimensional table of heteregenous data. It is conceptually very similar to a dictionary of Series
objects with a common index.
students = pd.DataFrame({'name' : ['Peter', 'Paul', 'Mary',
'Peter', 'Paul', 'Mary'],
'subject' : ['English', 'English', 'English',
'Math', 'Math', 'Math'],
'grade' : [85.0, 76.0, 92.0, 77.0, 68.0, 87.0]})
students
grade | name | subject | |
---|---|---|---|
0 | 85 | Peter | English |
1 | 76 | Paul | English |
2 | 92 | Mary | English |
3 | 77 | Peter | Math |
4 | 68 | Paul | Math |
5 | 87 | Mary | Math |
# We can get the column names
students.columns
Index(['grade', 'name', 'subject'], dtype='object')
# We can get a single Series indexing by column name...
students['name']
0 Peter 1 Paul 2 Mary 3 Peter 4 Paul 5 Mary Name: name, dtype: object
# ...or as an attribute
students.name
0 Peter 1 Paul 2 Mary 3 Peter 4 Paul 5 Mary Name: name, dtype: object
# We can get a subset of columns indexing with an array of labels
students[['subject', 'grade']]
subject | grade | |
---|---|---|
0 | English | 85 |
1 | English | 76 |
2 | English | 92 |
3 | Math | 77 |
4 | Math | 68 |
5 | Math | 87 |
# To get a subset of rows, slicing and indexing still work
students[-3:]
grade | name | subject | |
---|---|---|---|
3 | 77 | Peter | Math |
4 | 68 | Paul | Math |
5 | 87 | Mary | Math |
# But we can give boolean indexing an interesting twist
students[students.name == 'Mary']
grade | name | subject | |
---|---|---|---|
2 | 92 | Mary | English |
5 | 87 | Mary | Math |
But enough of pandas in the void, lets look at our dataset.
STAR is California's Standarized Testing and Reporting Program. You can get more information from the STAR website or from Wikipedia.
We are specifically going to look at the 2013 data. You can download the raw data for this tutorial from here:
http://www3.cde.ca.gov/starresearchfiles/2013/p3/ca2013_1_csv_v3.zip
Getting that data to a usable state, augmenting it with additional information, like income or sprending per student by county, is far from trivial. If we were to do that now, it would consume the whole tutorial. So we have done the ugly job for you, and you can download the clean version of the data from here:
This cleaning up of the data, munging it as data scientist refer to it, is an extremely important part of a data scientist job. Going over the process we followed to clean-up this data is covered in the appendix. Not as sexy as fitting models tot he data, but equally important.
But enough talking, let's start having fun with the cleaned-up data...
Importing a CSV file couldn't be easier with pandas' read_csv
(read the docs!) function. It has dozens of optional configuration parameters, but the defaults work like a charm on properly formatted data.
data = pd.read_csv('star_2013_clean_wide.csv')
data
County.Code | County.Name | Test.Id | Test.Name | Grade | Students.Tested | Mean.Scale.Score | Count.Test.Grade | Pct.Test.Grade | Population | per.capita.income | median.household.income | median.family.income | Spend | ADAttend | Spend.Per.ADA | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Alameda | 7 | CST English-Language Arts | 2 | 16814 | 372.1 | 464896 | 0.036167 | 1494876 | 34937 | 70821 | 87012 | 1814932885 | 193906 | 9360 |
1 | 1 | Alameda | 8 | CST Mathematics | 2 | 16802 | 398.4 | 464515 | 0.036171 | 1494876 | 34937 | 70821 | 87012 | 1814932885 | 193906 | 9360 |
2 | 1 | Alameda | 8 | CST Mathematics | 3 | 16198 | 411.1 | 443961 | 0.036485 | 1494876 | 34937 | 70821 | 87012 | 1814932885 | 193906 | 9360 |
3 | 1 | Alameda | 7 | CST English-Language Arts | 3 | 16126 | 357.2 | 441572 | 0.036520 | 1494876 | 34937 | 70821 | 87012 | 1814932885 | 193906 | 9360 |
4 | 1 | Alameda | 7 | CST English-Language Arts | 4 | 15390 | 387.0 | 428906 | 0.035882 | 1494876 | 34937 | 70821 | 87012 | 1814932885 | 193906 | 9360 |
5 | 1 | Alameda | 8 | CST Mathematics | 4 | 15532 | 405.8 | 433012 | 0.035870 | 1494876 | 34937 | 70821 | 87012 | 1814932885 | 193906 | 9360 |
6 | 1 | Alameda | 8 | CST Mathematics | 5 | 15146 | 411.6 | 432775 | 0.034997 | 1494876 | 34937 | 70821 | 87012 | 1814932885 | 193906 | 9360 |
7 | 1 | Alameda | 7 | CST English-Language Arts | 5 | 15077 | 378.4 | 429498 | 0.035104 | 1494876 | 34937 | 70821 | 87012 | 1814932885 | 193906 | 9360 |
8 | 1 | Alameda | 32 | CST Science - Grade 5, Grade 8, and Grade 10 L... | 5 | 15107 | 380.6 | 431142 | 0.035040 | 1494876 | 34937 | 70821 | 87012 | 1814932885 | 193906 | 9360 |
9 | 1 | Alameda | 7 | CST English-Language Arts | 6 | 15277 | 374.9 | 434374 | 0.035170 | 1494876 | 34937 | 70821 | 87012 | 1814932885 | 193906 | 9360 |
10 | 1 | Alameda | 8 | CST Mathematics | 6 | 15332 | 380.3 | 436563 | 0.035120 | 1494876 | 34937 | 70821 | 87012 | 1814932885 | 193906 | 9360 |
11 | 1 | Alameda | 7 | CST English-Language Arts | 7 | 14551 | 378.4 | 431187 | 0.033746 | 1494876 | 34937 | 70821 | 87012 | 1814932885 | 193906 | 9360 |
12 | 1 | Alameda | 9 | CST Algebra I | 7 | 2239 | 431.6 | 37803 | 0.059228 | 1494876 | 34937 | 70821 | 87012 | 1814932885 | 193906 | 9360 |
13 | 1 | Alameda | 8 | CST Mathematics | 7 | 12326 | 370.2 | 393811 | 0.031299 | 1494876 | 34937 | 70821 | 87012 | 1814932885 | 193906 | 9360 |
14 | 1 | Alameda | 9 | CST Algebra I | 8 | 11006 | 354.2 | 276039 | 0.039871 | 1494876 | 34937 | 70821 | 87012 | 1814932885 | 193906 | 9360 |
15 | 1 | Alameda | 13 | CST Algebra II | 8 | 31 | 458.9 | 680 | 0.045588 | 1494876 | 34937 | 70821 | 87012 | 1814932885 | 193906 | 9360 |
16 | 1 | Alameda | 7 | CST English-Language Arts | 8 | 14549 | 373.9 | 435491 | 0.033408 | 1494876 | 34937 | 70821 | 87012 | 1814932885 | 193906 | 9360 |
17 | 1 | Alameda | 29 | CST History - Social Science Grade 8 | 8 | 15421 | 363.3 | 459125 | 0.033588 | 1494876 | 34937 | 70821 | 87012 | 1814932885 | 193906 | 9360 |
18 | 1 | Alameda | 28 | CST General Mathematics | 8 | 2143 | 308.4 | 145549 | 0.014724 | 1494876 | 34937 | 70821 | 87012 | 1814932885 | 193906 | 9360 |
19 | 1 | Alameda | 11 | CST Geometry | 8 | 1732 | 428.0 | 29035 | 0.059652 | 1494876 | 34937 | 70821 | 87012 | 1814932885 | 193906 | 9360 |
20 | 1 | Alameda | 32 | CST Science - Grade 5, Grade 8, and Grade 10 L... | 8 | 14552 | 404.5 | 436071 | 0.033371 | 1494876 | 34937 | 70821 | 87012 | 1814932885 | 193906 | 9360 |
21 | 1 | Alameda | 10 | CST Integrated Math 1 | 9 | 121 | 298.8 | 980 | 0.123469 | 1494876 | 34937 | 70821 | 87012 | 1814932885 | 193906 | 9360 |
22 | 1 | Alameda | 20 | CST Biology | 9 | 9473 | 377.0 | 228962 | 0.041374 | 1494876 | 34937 | 70821 | 87012 | 1814932885 | 193906 | 9360 |
23 | 1 | Alameda | 21 | CST Chemistry | 9 | 186 | 364.3 | 5365 | 0.034669 | 1494876 | 34937 | 70821 | 87012 | 1814932885 | 193906 | 9360 |
24 | 1 | Alameda | 24 | CST Integrated/Coordinated Science 1 | 9 | 905 | 321.4 | 33944 | 0.026662 | 1494876 | 34937 | 70821 | 87012 | 1814932885 | 193906 | 9360 |
25 | 1 | Alameda | 22 | CST Earth Science | 9 | 1651 | 328.8 | 133961 | 0.012324 | 1494876 | 34937 | 70821 | 87012 | 1814932885 | 193906 | 9360 |
26 | 1 | Alameda | 23 | CST Physics | 9 | 754 | 312.8 | 14112 | 0.053430 | 1494876 | 34937 | 70821 | 87012 | 1814932885 | 193906 | 9360 |
27 | 1 | Alameda | 25 | CST Integrated/Coordinated Science 2 | 9 | 124 | 323.0 | 1351 | 0.091784 | 1494876 | 34937 | 70821 | 87012 | 1814932885 | 193906 | 9360 |
28 | 1 | Alameda | 18 | CST World History | 9 | 597 | 320.1 | 35087 | 0.017015 | 1494876 | 34937 | 70821 | 87012 | 1814932885 | 193906 | 9360 |
29 | 1 | Alameda | 11 | CST Geometry | 9 | 6613 | 353.4 | 144063 | 0.045904 | 1494876 | 34937 | 70821 | 87012 | 1814932885 | 193906 | 9360 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3700 | 58 | Yuba | 22 | CST Earth Science | 10 | 290 | 327.0 | 30649 | 0.009462 | 71817 | 20046 | 46617 | 52775 | 99237214 | 11710 | 8474 |
3701 | 58 | Yuba | 15 | CST Summative High School Mathematics | 10 | 19 | 353.3 | 24757 | 0.000767 | 71817 | 20046 | 46617 | 52775 | 99237214 | 11710 | 8474 |
3702 | 58 | Yuba | 21 | CST Chemistry | 10 | 166 | 358.8 | 136209 | 0.001219 | 71817 | 20046 | 46617 | 52775 | 99237214 | 11710 | 8474 |
3703 | 58 | Yuba | 13 | CST Algebra II | 10 | 201 | 335.9 | 131448 | 0.001529 | 71817 | 20046 | 46617 | 52775 | 99237214 | 11710 | 8474 |
3704 | 58 | Yuba | 9 | CST Algebra I | 10 | 259 | 285.9 | 104567 | 0.002477 | 71817 | 20046 | 46617 | 52775 | 99237214 | 11710 | 8474 |
3705 | 58 | Yuba | 7 | CST English-Language Arts | 10 | 931 | 332.0 | 455362 | 0.002045 | 71817 | 20046 | 46617 | 52775 | 99237214 | 11710 | 8474 |
3706 | 58 | Yuba | 20 | CST Biology | 10 | 238 | 343.3 | 228138 | 0.001043 | 71817 | 20046 | 46617 | 52775 | 99237214 | 11710 | 8474 |
3707 | 58 | Yuba | 11 | CST Geometry | 10 | 274 | 296.1 | 156167 | 0.001755 | 71817 | 20046 | 46617 | 52775 | 99237214 | 11710 | 8474 |
3708 | 58 | Yuba | 11 | CST Geometry | 11 | 151 | 287.1 | 78308 | 0.001928 | 71817 | 20046 | 46617 | 52775 | 99237214 | 11710 | 8474 |
3709 | 58 | Yuba | 19 | CST U.S. History | 11 | 895 | 325.7 | 447386 | 0.002001 | 71817 | 20046 | 46617 | 52775 | 99237214 | 11710 | 8474 |
3710 | 58 | Yuba | 20 | CST Biology | 11 | 186 | 330.3 | 99776 | 0.001864 | 71817 | 20046 | 46617 | 52775 | 99237214 | 11710 | 8474 |
3711 | 58 | Yuba | 7 | CST English-Language Arts | 11 | 899 | 324.9 | 440115 | 0.002043 | 71817 | 20046 | 46617 | 52775 | 99237214 | 11710 | 8474 |
3712 | 58 | Yuba | 9 | CST Algebra I | 11 | 187 | 281.3 | 49868 | 0.003750 | 71817 | 20046 | 46617 | 52775 | 99237214 | 11710 | 8474 |
3713 | 58 | Yuba | 13 | CST Algebra II | 11 | 199 | 302.4 | 122079 | 0.001630 | 71817 | 20046 | 46617 | 52775 | 99237214 | 11710 | 8474 |
3714 | 58 | Yuba | 21 | CST Chemistry | 11 | 121 | 342.4 | 133804 | 0.000904 | 71817 | 20046 | 46617 | 52775 | 99237214 | 11710 | 8474 |
3715 | 58 | Yuba | 15 | CST Summative High School Mathematics | 11 | 153 | 322.6 | 124304 | 0.001231 | 71817 | 20046 | 46617 | 52775 | 99237214 | 11710 | 8474 |
3716 | 58 | Yuba | 22 | CST Earth Science | 11 | 187 | 326.8 | 42331 | 0.004418 | 71817 | 20046 | 46617 | 52775 | 99237214 | 11710 | 8474 |
3717 | 58 | Yuba | 23 | CST Physics | 11 | 91 | 363.8 | 56726 | 0.001604 | 71817 | 20046 | 46617 | 52775 | 99237214 | 11710 | 8474 |
3718 | 58 | Yuba | 18 | CST World History | 11 | 40 | 300.3 | 16163 | 0.002475 | 71817 | 20046 | 46617 | 52775 | 99237214 | 11710 | 8474 |
3719 | 58 | Yuba | 18 | CST World History | 13 | 874 | 329.1 | 474255 | 0.001843 | 71817 | 20046 | 46617 | 52775 | 99237214 | 11710 | 8474 |
3720 | 58 | Yuba | 25 | CST Integrated/Coordinated Science 2 | 13 | 14 | 302.3 | 3742 | 0.003741 | 71817 | 20046 | 46617 | 52775 | 99237214 | 11710 | 8474 |
3721 | 58 | Yuba | 20 | CST Biology | 13 | 876 | 340.6 | 556893 | 0.001573 | 71817 | 20046 | 46617 | 52775 | 99237214 | 11710 | 8474 |
3722 | 58 | Yuba | 22 | CST Earth Science | 13 | 875 | 328.9 | 206991 | 0.004227 | 71817 | 20046 | 46617 | 52775 | 99237214 | 11710 | 8474 |
3723 | 58 | Yuba | 23 | CST Physics | 13 | 92 | 363.6 | 80833 | 0.001138 | 71817 | 20046 | 46617 | 52775 | 99237214 | 11710 | 8474 |
3724 | 58 | Yuba | 28 | CST General Mathematics | 13 | 585 | 299.5 | 190615 | 0.003069 | 71817 | 20046 | 46617 | 52775 | 99237214 | 11710 | 8474 |
3725 | 58 | Yuba | 13 | CST Algebra II | 13 | 431 | 320.6 | 286737 | 0.001503 | 71817 | 20046 | 46617 | 52775 | 99237214 | 11710 | 8474 |
3726 | 58 | Yuba | 21 | CST Chemistry | 13 | 287 | 351.9 | 275452 | 0.001042 | 71817 | 20046 | 46617 | 52775 | 99237214 | 11710 | 8474 |
3727 | 58 | Yuba | 15 | CST Summative High School Mathematics | 13 | 172 | 326.0 | 149987 | 0.001147 | 71817 | 20046 | 46617 | 52775 | 99237214 | 11710 | 8474 |
3728 | 58 | Yuba | 11 | CST Geometry | 13 | 695 | 307.8 | 407658 | 0.001705 | 71817 | 20046 | 46617 | 52775 | 99237214 | 11710 | 8474 |
3729 | 58 | Yuba | 9 | CST Algebra I | 13 | 1495 | 319.4 | 711705 | 0.002101 | 71817 | 20046 | 46617 | 52775 | 99237214 | 11710 | 8474 |
3730 rows × 16 columns
# A closer look at the available columns
data.columns
Index(['County.Code', 'County.Name', 'Test.Id', 'Test.Name', 'Grade', 'Students.Tested', 'Mean.Scale.Score', 'Count.Test.Grade', 'Pct.Test.Grade', 'Population', 'per.capita.income', 'median.household.income', 'median.family.income', 'Spend', 'ADAttend', 'Spend.Per.ADA'], dtype='object')
# What are the unique test names?
data['Test.Name'].unique()
array(['CST English-Language Arts', 'CST Mathematics', 'CST Science - Grade 5, Grade 8, and Grade 10 Life Science', 'CST Algebra I', 'CST Algebra II', 'CST History - Social Science Grade 8', 'CST General Mathematics', 'CST Geometry', 'CST Integrated Math 1', 'CST Biology', 'CST Chemistry', 'CST Integrated/Coordinated Science 1', 'CST Earth Science', 'CST Physics', 'CST Integrated/Coordinated Science 2', 'CST World History', 'CST Summative High School Mathematics', 'CST Integrated Math 2', 'CST U.S. History', 'CST Integrated/Coordinated Science 4', 'CST Integrated Math 3', 'CST Integrated/Coordinated Science 3'], dtype=object)
Which introduces us to the pivot_table
(read the docs!) method of DataFrame
s.
data.pivot_table(values='Students.Tested', index='Test.Name',
columns='Grade', aggfunc='sum')
Grade | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 13 |
---|---|---|---|---|---|---|---|---|---|---|---|
Test.Name | |||||||||||
CST Algebra I | NaN | NaN | NaN | NaN | NaN | 37803 | 276039 | 243349 | 104555 | 49846 | 711671 |
CST Algebra II | NaN | NaN | NaN | NaN | NaN | NaN | 680 | 32400 | 131448 | 122079 | 286737 |
CST Biology | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 228962 | 228138 | 99753 | 556867 |
CST Chemistry | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 5365 | 136209 | 133804 | 275452 |
CST Earth Science | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 133961 | 30649 | 42316 | 206974 |
CST English-Language Arts | 464896 | 441572 | 428906 | 429498 | 434374 | 431187 | 435491 | 463195 | 455237 | 439972 | NaN |
CST General Mathematics | NaN | NaN | NaN | NaN | NaN | NaN | 145549 | 45052 | NaN | NaN | 190615 |
CST Geometry | NaN | NaN | NaN | NaN | NaN | NaN | 29035 | 144063 | 156167 | 78308 | 407658 |
CST History - Social Science Grade 8 | NaN | NaN | NaN | NaN | NaN | NaN | 459125 | NaN | NaN | NaN | NaN |
CST Integrated Math 1 | NaN | NaN | NaN | NaN | NaN | NaN | 125 | 980 | 5965 | 6318 | 13486 |
CST Integrated Math 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 236 | 1432 | 3364 | 5102 |
CST Integrated Math 3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 61 | 153 | 530 | 769 |
CST Integrated/Coordinated Science 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 33944 | 4401 | 7963 | 46386 |
CST Integrated/Coordinated Science 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1351 | 1413 | 955 | 3742 |
CST Integrated/Coordinated Science 3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 11 | 46 | 874 | 952 |
CST Integrated/Coordinated Science 4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 18 | NaN | NaN | 43 |
CST Mathematics | 464515 | 443961 | 433012 | 432775 | 436563 | 393811 | NaN | NaN | NaN | NaN | NaN |
CST Physics | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 14112 | 9902 | 56726 | 80833 |
CST Science - Grade 5, Grade 8, and Grade 10 Life Science | NaN | NaN | NaN | 431142 | NaN | NaN | 436071 | NaN | 451253 | NaN | NaN |
CST Summative High School Mathematics | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 809 | 24757 | 124304 | 149987 |
CST U.S. History | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 447253 | NaN |
CST World History | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 35087 | 422893 | 16163 | 474237 |
We are going to use a very powerful feature of DataFrame
s, the groupby
(read the docs!) method.
The section of the docs on the split-apply-combine philosophy of Group By operations (read the docs!) is well worth a careful read.
# First, keep the English test rows only
english = data[data['Test.Name'] == 'CST English-Language Arts']
english_grade = english.groupby('Grade')['Mean.Scale.Score'].aggregate(np.mean)
english_grade
Grade 2 355.771930 3 343.361404 4 370.684211 5 363.074138 6 361.592982 7 366.333333 8 362.998276 9 360.000000 10 344.621053 11 340.910526 Name: Mean.Scale.Score, dtype: float64
english_grade.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x109dc0400>
Wait a minute! Was that calculation correct?
Not really! What we have is the average accross counties, not accross students. Lets try to fix that.
# We add a new calculated column to our dataframe
english['Weighted.Score'] = english['Mean.Scale.Score'] * english['Pct.Test.Grade']
# Then group by grade and plot the aggregated sum of this column
english.groupby('Grade')['Weighted.Score'].aggregate(np.sum).plot()
<matplotlib.axes._subplots.AxesSubplot at 0x10a1c0f60>
# Keep the relevant rows only
math_7th = data[(data['Test.Name'] == 'CST Mathematics') &
(data['Grade'] == 7)]
math_7th.plot(x='median.family.income', y='Mean.Scale.Score', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x10a2d6940>
What's with the outlier? Let's try to figure it out...
# Let's peek at the offending row
math_7th[math_7th['Mean.Scale.Score'] < 300]
County.Code | County.Name | Test.Id | Test.Name | Grade | Students.Tested | Mean.Scale.Score | Count.Test.Grade | Pct.Test.Grade | Population | per.capita.income | median.household.income | median.family.income | Spend | ADAttend | Spend.Per.ADA | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2918 | 46 | Sierra | 8 | CST Mathematics | 7 | 27 | 259.5 | 393811 | 0.000069 | 3277 | 26137 | 50308 | 56469 | 4739373 | 354 | 13391 |
Only 27 students tested, that doesn't seem like very many. Let's see how it compares to other counties.
math_7th.plot(x='Students.Tested', y='Mean.Scale.Score',
kind='scatter', logx=True)
<matplotlib.axes._subplots.AxesSubplot at 0x10a231710>
So Sierra seems to be a legit outlier. Let's remove it from our analysis.
math_7th = math_7th[math_7th['County.Name'] != 'Sierra']
math_7th.plot(x='median.family.income', y='Mean.Scale.Score', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x10a8ed9b0>
# Let's fit a linear model to our data
y = math_7th['Mean.Scale.Score'] # response
X = math_7th['median.family.income'] # predictor
X = sm.add_constant(X) # Add a constant term to the predictor
# The actual fitting happens here
est = sm.OLS(y, X)
est = est.fit()
est.summary()
Dep. Variable: | Mean.Scale.Score | R-squared: | 0.298 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.285 |
Method: | Least Squares | F-statistic: | 22.93 |
Date: | Thu, 26 Feb 2015 | Prob (F-statistic): | 1.35e-05 |
Time: | 22:58:08 | Log-Likelihood: | -212.31 |
No. Observations: | 56 | AIC: | 428.6 |
Df Residuals: | 54 | BIC: | 432.7 |
Df Model: | 1 |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
const | 332.0775 | 6.059 | 54.808 | 0.000 | 319.930 344.225 |
median.family.income | 0.0004 | 8.87e-05 | 4.788 | 0.000 | 0.000 0.001 |
Omnibus: | 0.749 | Durbin-Watson: | 1.990 |
---|---|---|---|
Prob(Omnibus): | 0.688 | Jarque-Bera (JB): | 0.280 |
Skew: | 0.142 | Prob(JB): | 0.869 |
Kurtosis: | 3.197 | Cond. No. | 2.84e+05 |
# Let's plot the regression line on top of the data
x_ = np.array([X.min(), X.max()])
y_ = est.predict(x_)
math_7th.plot(x='median.family.income', y='Mean.Scale.Score', kind='scatter')
plt.plot(x_[:, 1], y_, 'r-')
[<matplotlib.lines.Line2D at 0x10ad8d320>]
math_7th.plot(x='Spend.Per.ADA', y='Mean.Scale.Score', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x10b1e3748>
# Let's fit a linear model to our data
y = math_7th['Mean.Scale.Score'] # response
X = math_7th['Spend.Per.ADA'] # predictor
X = sm.add_constant(X) # Add a constant term to the predictor
# The actual fitting happens here
est = sm.OLS(y, X)
est = est.fit()
est.summary()
Dep. Variable: | Mean.Scale.Score | R-squared: | 0.064 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.047 |
Method: | Least Squares | F-statistic: | 3.691 |
Date: | Thu, 26 Feb 2015 | Prob (F-statistic): | 0.0600 |
Time: | 22:58:09 | Log-Likelihood: | -220.37 |
No. Observations: | 56 | AIC: | 444.7 |
Df Residuals: | 54 | BIC: | 448.8 |
Df Model: | 1 |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
const | 336.2436 | 12.602 | 26.682 | 0.000 | 310.979 361.509 |
Spend.Per.ADA | 0.0026 | 0.001 | 1.921 | 0.060 | -0.000 0.005 |
Omnibus: | 0.563 | Durbin-Watson: | 2.081 |
---|---|---|---|
Prob(Omnibus): | 0.755 | Jarque-Bera (JB): | 0.697 |
Skew: | 0.142 | Prob(JB): | 0.706 |
Kurtosis: | 2.533 | Cond. No. | 6.92e+04 |
# Let's plot the regression line on top of the data
x_ = np.array([X.min(), X.max()])
y_ = est.predict(x_)
math_7th.plot(x='Spend.Per.ADA', y='Mean.Scale.Score', kind='scatter')
plt.plot(x_[:, 1], y_, 'r-')
[<matplotlib.lines.Line2D at 0x10b184a20>]
# Let's fit a linear model to our data
y = math_7th['Mean.Scale.Score'] # response
X = math_7th[['median.family.income', 'Spend.Per.ADA']] # predictor
X = sm.add_constant(X) # Add a constant term to the predictor
# The actual fitting happens here
est = sm.OLS(y, X)
est = est.fit()
est.summary()
Dep. Variable: | Mean.Scale.Score | R-squared: | 0.359 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.335 |
Method: | Least Squares | F-statistic: | 14.82 |
Date: | Thu, 26 Feb 2015 | Prob (F-statistic): | 7.71e-06 |
Time: | 22:58:09 | Log-Likelihood: | -209.78 |
No. Observations: | 56 | AIC: | 425.6 |
Df Residuals: | 53 | BIC: | 431.6 |
Df Model: | 2 |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
const | 308.8693 | 11.900 | 25.955 | 0.000 | 285.000 332.738 |
median.family.income | 0.0004 | 8.56e-05 | 4.935 | 0.000 | 0.000 0.001 |
Spend.Per.ADA | 0.0025 | 0.001 | 2.239 | 0.029 | 0.000 0.005 |
Omnibus: | 0.769 | Durbin-Watson: | 1.849 |
---|---|---|---|
Prob(Omnibus): | 0.681 | Jarque-Bera (JB): | 0.865 |
Skew: | -0.189 | Prob(JB): | 0.649 |
Kurtosis: | 2.523 | Cond. No. | 5.82e+05 |
Now that you have completed the tutorial, you are ready to fly on your own... For the rest of the session, you are welcome to try out one of the following projects, and ask for help from the volunteers. The projects are roughly ordered by increasing difficulty level. Please be aware that the harder projects are very unlikely to be completed in an hour (unless you happen to be Jeff Dean).
If you are new to Pandas, going over the tutorial we have just completed at your own pace may already be enough of a challenge.
The data file we used for the tutorial had already been cleaned for you. The code to do that is explained in the appendix. Going over it is our second proposal.
There is a treasure trove of additional information breaking down test scores by interesting covariates like gender, ethnicity, parental education, and more.
This data is coded by using an additional column called “subgroup”. The key for what the subgroup numbers mean (for 2013) appears here:
http://star.cde.ca.gov/star2013/research_fixfileformat.aspx
(Scroll down, starting at “Gender”).
In the tutorial we created a graph showing English scores over time (grades 2-11). Using the subgroup data, can you make a similar chart, but with two lines, one for boys and one for girls? What other interesting insights can you glean from the subgroup data?
It’s SoCal vs NoCal... Every district has a latitude. Is latitude a useful predictor? Are there any differences between SoCal and NoCal schools? If you find an effect, can you demonstrate that it is valid even when you control for covariates such as income which might be correlated to latitude?
All of the data analyzed in the tutorial was at the county level. However, the actual STAR data is broken down to the district and school levels.
Do counties with smaller schools get better scores than counties with larger schools? Do counties with higher income tend to have smaller or larger schools? How would you measure school size within a county? Average? Median?
Can you find district level data for income or spending online? Can you merge it with the district level data for student scores?
If you can find (and process/merge) district level data for income or spending, when you see a relationship that seems to hold at the county level (e.g. income to test scores), does that hold uniformly within all the counties? Looking at districts within counties, are there some counties where the relationship holds and some where it doesn’t?
All of the data analyzed in the tutorial was from 2013. There were test scores from multiple grades, but each of those grades was tested in 2013. However, there is more data available at the STAR website. Specifically, there are 16 years worth of data on the STAR website, from 1998-2013. You can access this data at http://star.cde.ca.gov/
.
Because this data is available, if you combine this data appropriately, you can construct true longitudinal data, rather than cross sectional data. For example, the 11th graders of 2013 were the 10th graders of 2012, the 9th graders of 2011, etc...
Let’s look at an example. In the tutorial we made a slide with English scores by grade. That data was cross sectional. However, we could make the same type of plot with multiple lines – one per cohort. For example, there could be a line that connects scores for the 2nd graders of 1998, the 3rd graders of 1999, the 4th graders of 2000, etc... Another line could connect the scores for the 2nd graders of 1999, the 3rd graders of 2000, the 4th graders of 2001, etc...
Looking at the resulting plot, with multiple lines for multiple cohorts, one could get a global picture of whether English education in California, as judged by test scores across all grades, has been getting better or worse over the last 16 years. Are there any identifiable trends? What about other tests?
Bonus question: imagine you had a thousand tests – can you devise an automated way to look for trends in data like this? Does your automated test catch the trends you identified “by inspection”?
Download the 2013 California Statewide research file, All Students, comma delimited file: ca2013_1_csv_v3.zip
Unzip its contents. You will now have two files, the test data, ca2013_1_csv_v3.txt
, and some information on the names of counties, districts and schools, ca2013entities_csv.txt
.
Let's load the test data and start playing with it:
raw_data = pd.read_csv('ca2013_1_csv_v3.txt')
# Let's list the columns in the loaded data frame
raw_data.columns
Index(['County Code', 'District Code', 'School Code', 'Charter Number', 'Test Year', 'Subgroup ID', 'Test Type', 'CAPA Assessment Level', 'Total STAR Enrollment', 'Total Tested At Entity Level', 'Total Tested At Subgroup Level', 'Grade', 'Test Id', 'STAR Reported Enrollment/CAPA Eligible', 'Students Tested', 'Percent Tested', 'Mean Scale Score', 'Percentage Advanced', 'Percentage Proficient', 'Percentage At Or Above Proficient', 'Percentage Basic', 'Percentage Below Basic', 'Percentage Far Below Basic', 'Students with Scores'], dtype='object')
raw_data = raw_data[['County Code', 'District Code', 'Grade', 'Test Id', 'Students Tested', 'Mean Scale Score']]
'District Code'
is zero, the that row holds data aggregated for all districts. To get data aggregated at the county level we need to get rid of the rows where 'County Code'
is zero (those hold aggregated data for all of California), as well as all rows where 'District Code'
is not zero.raw_data = raw_data[(raw_data['County Code'] != 0) & (raw_data['District Code'] == 0)]
'District Code'
column which we will not be needing any more.del raw_data['District Code']
raw_data[:5]
County Code | Grade | Test Id | Students Tested | Mean Scale Score | |
---|---|---|---|---|---|
196 | 1 | 2 | 7 | 17260 | 368.2 |
197 | 1 | 2 | 8 | 17249 | 393.5 |
198 | 1 | 3 | 7 | 16276 | 357.2 |
199 | 1 | 3 | 8 | 16364 | 414.1 |
200 | 1 | 4 | 7 | 15936 | 385.9 |
entities = pd.read_csv('ca2013entities_csv.txt')
# Let's take a look at the columns here
entities.columns
Index(['County Code', 'District Code', 'School Code', 'Charter Number', 'Test Year', 'Type Id', 'County Name', 'District Name', 'School Name', 'Zip Code'], dtype='object')
# Let's keep what we need only
entities = entities[['County Code', 'District Code', 'County Name']]
# And let's get rid of the non county specific data
entities = entities[(entities['County Code'] != 0) & (entities['District Code'] == 0)]
entities[:5]
County Code | District Code | County Name | |
---|---|---|---|
1 | 1 | 0 | Alameda |
414 | 2 | 0 | Alpine |
420 | 3 | 0 | Amador |
438 | 4 | 0 | Butte |
548 | 5 | 0 | Calaveras |
# We can now get rid of the 'District Code' column
del entities['District Code']
# And make the 'County Code' column the index of the data frame
entities.set_index('County Code', inplace=True)
# Lastly, we use this as a LUT for the raw_data 'County Code'. NOtice the use of
# .loc to ensure the indexing is label, not position, based
county_names = entities.loc[raw_data['County Code']]
# And we add this information as a new column...
county_names.index = raw_data.index
raw_data['County Name'] = county_names
raw_data[:5]
County Code | Grade | Test Id | Students Tested | Mean Scale Score | County Name | |
---|---|---|---|---|---|---|
196 | 1 | 2 | 7 | 17260 | 368.2 | Alameda |
197 | 1 | 2 | 8 | 17249 | 393.5 | Alameda |
198 | 1 | 3 | 7 | 16276 | 357.2 | Alameda |
199 | 1 | 3 | 8 | 16364 | 414.1 | Alameda |
200 | 1 | 4 | 7 | 15936 | 385.9 | Alameda |
'Test Id'
, but a descriptive name of the test. This information is a little hidden, but can be found in table C in the Research File Layout page of the STAR website. We need that table in a more readable format. You can find it as a CSV file in xxx
, obtained by copy-pasting the HTML into a text editor, replacing tabs with commas, and enclosing in quotes the names that had commas in themtest_name_lut = pd.read_csv('test_names.csv')
test_name_lut.set_index('Test ID', inplace=True)
test_names = test_name_lut.loc[raw_data['Test Id']]
test_names.index = raw_data.index
raw_data['Test Name'] = test_names
raw_data[:5]
County Code | Grade | Test Id | Students Tested | Mean Scale Score | County Name | Test Name | |
---|---|---|---|---|---|---|---|
196 | 1 | 2 | 7 | 17260 | 368.2 | Alameda | CST English-Language Arts |
197 | 1 | 2 | 8 | 17249 | 393.5 | Alameda | CST Mathematics |
198 | 1 | 3 | 7 | 16276 | 357.2 | Alameda | CST English-Language Arts |
199 | 1 | 3 | 8 | 16364 | 414.1 | Alameda | CST Mathematics |
200 | 1 | 4 | 7 | 15936 | 385.9 | Alameda | CST English-Language Arts |
'Mean Scale Score'
is not numeric. This is due to the use of *
to indicate missing data. We could fix this by reloading the data using the na_values
keyoword, i.e. raw_data = pd.read_csv('ca2013_1_csv_v3.txt', na_values=['*'])
, but we can fix it without redoing all the previous work using the .convert_objects
method.raw_data['Mean Scale Score'] = raw_data['Mean Scale Score'].convert_objects(convert_numeric=True)
raw_data['Mean Scale Score'].dtype
dtype('float64')
income.csv
. We can similarly add the new columns, this time using the county name as index for our look up. To actually get this to work, we are going to have to get rid of the rows with 'County Code'
set to 99 (the 'CA Education Authority'
rows).raw_data = raw_data[raw_data['County Code'] != 99]
income = pd.read_csv('income.csv')
income.set_index('County', inplace=True)
income = income.loc[raw_data['County Name']]
income.index = raw_data.index
raw_data = raw_data.join(income)
raw_data[:5]
County Code | Grade | Test Id | Students Tested | Mean Scale Score | County Name | Test Name | Population | Per capita income | Median household income | Median family income | |
---|---|---|---|---|---|---|---|---|---|---|---|
196 | 1 | 2 | 7 | 17260 | 368.2 | Alameda | CST English-Language Arts | 1494876 | 34937 | 70821 | 87012 |
197 | 1 | 2 | 8 | 17249 | 393.5 | Alameda | CST Mathematics | 1494876 | 34937 | 70821 | 87012 |
198 | 1 | 3 | 7 | 16276 | 357.2 | Alameda | CST English-Language Arts | 1494876 | 34937 | 70821 | 87012 |
199 | 1 | 3 | 8 | 16364 | 414.1 | Alameda | CST Mathematics | 1494876 | 34937 | 70821 | 87012 |
200 | 1 | 4 | 7 | 15936 | 385.9 | Alameda | CST English-Language Arts | 1494876 | 34937 | 70821 | 87012 |
currentexpense1213.xls
, which we can load into Pandas with the read_excel
command and some minimal tweaking.expense_lut = pd.read_excel('currentexpense1213.xls',
sheetname='County Averages', skiprows=5)
expense_lut.set_index('County', inplace=True)
expenses = expense_lut.loc[raw_data['County Name']]
expenses.index = raw_data.index
raw_data = raw_data.join(expenses)
raw_data[:5]
County Code | Grade | Test Id | Students Tested | Mean Scale Score | County Name | Test Name | Population | Per capita income | Median household income | Median family income | Expenditures (EDP 365) | Current Expense ADA | Current Expense Per ADA | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
196 | 1 | 2 | 7 | 17260 | 368.2 | Alameda | CST English-Language Arts | 1494876 | 34937 | 70821 | 87012 | 1.716407e+09 | 191964.88 | 8941.253984 |
197 | 1 | 2 | 8 | 17249 | 393.5 | Alameda | CST Mathematics | 1494876 | 34937 | 70821 | 87012 | 1.716407e+09 | 191964.88 | 8941.253984 |
198 | 1 | 3 | 7 | 16276 | 357.2 | Alameda | CST English-Language Arts | 1494876 | 34937 | 70821 | 87012 | 1.716407e+09 | 191964.88 | 8941.253984 |
199 | 1 | 3 | 8 | 16364 | 414.1 | Alameda | CST Mathematics | 1494876 | 34937 | 70821 | 87012 | 1.716407e+09 | 191964.88 | 8941.253984 |
200 | 1 | 4 | 7 | 15936 | 385.9 | Alameda | CST English-Language Arts | 1494876 | 34937 | 70821 | 87012 | 1.716407e+09 | 191964.88 | 8941.253984 |
'Pct Test Grade'
we used to compute the weighted average. What this has is what percentage of the total population taking that test in that grade is represented in that row.grouped = raw_data.groupby(['Grade', 'Test Id'])['Students Tested']
raw_data['Pct Test Grade'] = grouped.apply(lambda x: x / x.sum())
raw_data[:5]
County Code | Grade | Test Id | Students Tested | Mean Scale Score | County Name | Test Name | Population | Per capita income | Median household income | Median family income | Expenditures (EDP 365) | Current Expense ADA | Current Expense Per ADA | Pct Test Grade | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
196 | 1 | 2 | 7 | 17260 | 368.2 | Alameda | CST English-Language Arts | 1494876 | 34937 | 70821 | 87012 | 1.716407e+09 | 191964.88 | 8941.253984 | 0.036883 |
197 | 1 | 2 | 8 | 17249 | 393.5 | Alameda | CST Mathematics | 1494876 | 34937 | 70821 | 87012 | 1.716407e+09 | 191964.88 | 8941.253984 | 0.036883 |
198 | 1 | 3 | 7 | 16276 | 357.2 | Alameda | CST English-Language Arts | 1494876 | 34937 | 70821 | 87012 | 1.716407e+09 | 191964.88 | 8941.253984 | 0.036686 |
199 | 1 | 3 | 8 | 16364 | 414.1 | Alameda | CST Mathematics | 1494876 | 34937 | 70821 | 87012 | 1.716407e+09 | 191964.88 | 8941.253984 | 0.036689 |
200 | 1 | 4 | 7 | 15936 | 385.9 | Alameda | CST English-Language Arts | 1494876 | 34937 | 70821 | 87012 | 1.716407e+09 | 191964.88 | 8941.253984 | 0.036812 |
raw_data.rename(columns={'County Code': 'County.Code',
'Test Id': 'Test.Id',
'Students Tested': 'Students.Tested',
'Mean Scale Score': 'Mean.Scale.Score',
'County Name': 'County.Name',
'Test Name': 'Test.Name',
'Per capita income': 'per.capita.income',
'Median household income': 'median.household.income',
'Median family income': 'median.family.income',
'Expenditures (EDP 365)': 'Spend',
'Current Expense ADA': 'ADAttend',
'Current Expense Per ADA': 'Spend.Per.ADA',
'Pct Test Grade': 'Pct.Test.Grade'}, inplace=True)
raw_data[:5]
County.Code | Grade | Test.Id | Students.Tested | Mean.Scale.Score | County.Name | Test.Name | Population | per.capita.income | median.household.income | median.family.income | Spend | ADAttend | Spend.Per.ADA | Pct.Test.Grade | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
196 | 1 | 2 | 7 | 17260 | 368.2 | Alameda | CST English-Language Arts | 1494876 | 34937 | 70821 | 87012 | 1.716407e+09 | 191964.88 | 8941.253984 | 0.036883 |
197 | 1 | 2 | 8 | 17249 | 393.5 | Alameda | CST Mathematics | 1494876 | 34937 | 70821 | 87012 | 1.716407e+09 | 191964.88 | 8941.253984 | 0.036883 |
198 | 1 | 3 | 7 | 16276 | 357.2 | Alameda | CST English-Language Arts | 1494876 | 34937 | 70821 | 87012 | 1.716407e+09 | 191964.88 | 8941.253984 | 0.036686 |
199 | 1 | 3 | 8 | 16364 | 414.1 | Alameda | CST Mathematics | 1494876 | 34937 | 70821 | 87012 | 1.716407e+09 | 191964.88 | 8941.253984 | 0.036689 |
200 | 1 | 4 | 7 | 15936 | 385.9 | Alameda | CST English-Language Arts | 1494876 | 34937 | 70821 | 87012 | 1.716407e+09 | 191964.88 | 8941.253984 | 0.036812 |
raw_data.to_csv('star_2013_cleaned_up.csv', index=False)