It's always fun to play around with marks distributions. Let's load some dependencies...
import pandas as pd
from pylab import *
Let's also ensure that what Pandas prints is fully displayed as HTML.
pd.set_option('display.max_columns', 50)
pd.set_option('display.line_width', 2000)
... and get started with Tamil Nadu SSLC results.
The data is provided in this structure:
print open('D:/site/gramener.com/viz/edu/TN-10-2012/README.TXT').read()[2600:3700]
--------------------------------------------------------------------------- Structure for S.S.L.C DATABASE : NET10M_S.DBF Records : 967420 ---------------------------------------------------------------------------------- Field Field Name Type Width 1 DIST Character 2 -> Edn.District Code 2 REGNO Character 6 -> Registration No. 3 NAME Character 29 4 MARK1 Character 3 -> Language Mark 5 MARK2 Character 3 -> English Mark 6 MARK3 Character 3 -> Mathematics Mark 7 MARK4 Character 3 -> Science Mark 8 MARK5 Character 3 -> Social Science Mark 9 TOTAL Character 3 -> Total Mark 10 PASS Character 1 -> Pass "P" indicates passed in examination 11 WITHHELD Character 1 -> Withheld "W" indicates result is withheld 12 BLANK Character 2 -------------------------
It's a CSV file without headers. So let's load it.
tn = pd.read_csv('D:/site/gramener.com/viz/edu/TN-10-2012/net10m_s.csv', names=['DIST', 'REGNO', 'NAME', 'LANG', 'ENG', 'MATH', 'SCI', 'SOC', 'MARKS', 'PASS', 'WITHHELD'], index_col=False)
tn.head(10)
DIST | REGNO | NAME | LANG | ENG | MATH | SCI | SOC | MARKS | PASS | WITHHELD | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 10001 | ANISH M | 060 | 036 | 047 | 039 | 036 | 218 | P | NaN |
1 | 1 | 10002 | AJITHKUMAR K | 070 | 048 | 076 | 050 | 061 | 305 | P | NaN |
2 | 1 | 10003 | ASHOK S | 072 | 045 | 070 | 052 | 048 | 287 | P | NaN |
3 | 1 | 10004 | AJITHSINGH J | 067 | 038 | 050 | 050 | 048 | 253 | P | NaN |
4 | 1 | 10005 | ARAVINTH A | 056 | 031 | 051 | 026 | 041 | 205 | NaN | NaN |
5 | 1 | 10006 | ANTO S | AAA | AAA | AAA | AAA | AAA | AAA | NaN | A |
6 | 1 | 10007 | DAVIDRAJ T | 053 | 028 | 036 | 037 | 022 | 176 | NaN | NaN |
7 | 1 | 10008 | JAGATHEES S | 051 | 035 | 035 | 044 | 035 | 200 | P | NaN |
8 | 1 | 10009 | MANIKANDAN S | 051 | 038 | 045 | 035 | 036 | 205 | P | NaN |
9 | 1 | 10010 | MATHEES N | 077 | 045 | 080 | 083 | 057 | 342 | P | NaN |
We have each student's marks in one row. The district and registration number probably define the unique key. (It might just be the registration number alone too.) Their marks in Language, English, etc are give. We also know whether they passed (the PASS column in "P") or failed (it's NaN, i.e. empty). I've no idea what the WITHHELD column has. Let's ignore it.
As always, this contains messy data. The marks are not always numbers. You can see that Anto S in row 7 was absent for the exams, and the mark is shown as "AAA". We'd rather have that as a blank, i.e. NaN. So let's clean it up.
First, we need to check what other non-numeric values are there, except for "AAA".
numpy.sort(tn['LANG'].unique())
array([nan, '001', '002', '003', '004', '005', '006', '007', '008', '009', '010', '011', '012', '013', '014', '015', '016', '017', '018', '019', '020', '021', '022', '023', '024', '025', '026', '027', '028', '029', '030', '031', '032', '033', '034', '035', '036', '037', '038', '039', '040', '041', '042', '043', '044', '045', '046', '047', '048', '049', '050', '051', '052', '053', '054', '055', '056', '057', '058', '059', '060', '061', '062', '063', '064', '065', '066', '067', '068', '069', '070', '071', '072', '073', '074', '075', '076', '077', '078', '079', '080', '081', '082', '083', '084', '085', '086', '087', '088', '089', '090', '091', '092', '093', '094', '095', '096', '097', '098', '099', 'AAA'], dtype=object)
This doesn't look too bad. Other than "AAA", everything else is a number. Let's convert the "AAA" to nan.
for column in ['LANG', 'ENG', 'MATH', 'SCI', 'SOC', 'MARKS']:
tn[column] = tn[column].replace('AAA', numpy.nan).astype(float)
tn.head(10)
DIST | REGNO | NAME | LANG | ENG | MATH | SCI | SOC | MARKS | PASS | WITHHELD | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 10001 | ANISH M | 60 | 36 | 47 | 39 | 36 | 218 | P | NaN |
1 | 1 | 10002 | AJITHKUMAR K | 70 | 48 | 76 | 50 | 61 | 305 | P | NaN |
2 | 1 | 10003 | ASHOK S | 72 | 45 | 70 | 52 | 48 | 287 | P | NaN |
3 | 1 | 10004 | AJITHSINGH J | 67 | 38 | 50 | 50 | 48 | 253 | P | NaN |
4 | 1 | 10005 | ARAVINTH A | 56 | 31 | 51 | 26 | 41 | 205 | NaN | NaN |
5 | 1 | 10006 | ANTO S | NaN | NaN | NaN | NaN | NaN | NaN | NaN | A |
6 | 1 | 10007 | DAVIDRAJ T | 53 | 28 | 36 | 37 | 22 | 176 | NaN | NaN |
7 | 1 | 10008 | JAGATHEES S | 51 | 35 | 35 | 44 | 35 | 200 | P | NaN |
8 | 1 | 10009 | MANIKANDAN S | 51 | 38 | 45 | 35 | 36 | 205 | P | NaN |
9 | 1 | 10010 | MATHEES N | 77 | 45 | 80 | 83 | 57 | 342 | P | NaN |
Now, to get a quick sense of the pass marks by subject:
tn_marks = tn[['LANG', 'ENG', 'MATH', 'SCI', 'SOC']].dropna()
1 - (tn_marks > 35).sum().astype(float) / tn_marks.count()
LANG 0.079699 ENG 0.167786 MATH 0.152373 SCI 0.231818 SOC 0.120569 dtype: float64
Before we start looking at the distribution, let's set some defaults:
rcParams['figure.figsize'] = 25, 10
def distribution(series):
# Drop the nans, get the frequency, and sort it in ascending order
marks = series.dropna().value_counts().sort_index()
# Ensure that all marks from 0 to 100 are present, and draw a bar chart
pd.Series(marks, index=range(101)).plot(kind='bar')
distribution(tn['ENG'])
The English marks distributions look like a good normal distribution overall, except for the spike at 35 marks. It's easy to figure out what's happening here: 35 is the pass mark. Either the teachers or the board are liberally scooping up students with "just-fail" marks of 34, 33 etc. and awarding them 35 marks.
(We're not saying that this is a good or bad thing. Just that it's happening.)
It's also important to note that not everyone who gets 34 is made to pass. Some do, some don't. So there does appear to be an element of luck in whether a student passes or fails. (Offering prayers for English might not be a bad idea.)
distribution(tn['SOC'])
Now, social science marks tell a different story. Almost no one fails between 30 and 35. Further, students scoring between 25-30 are also promoted to the pass mark. This appears to be a fairly consistent policy, though it is unclear whether it is the teachers or the Board that have decided that the social sciences shall behave this way.
distribution(tn['LANG'])
In the languages (which is mostly Tamil in Tamil Nadu), there is not much moderation. But as you can see, there isn't much need for moderation in the first place. More students are getting 90% than nearly any other mark. It's only a handfull of students who are being made to pass. As with English, this is based on luck. It may be worth praying for the languages as well.
distribution(tn['MATH'])
Unlike the earlier subects, the extent of moderation in Maths is quite large. As with the social sciences, not many students score between 30 and 34. Further, thos scoring between 26 - 30 are further given the benefit of doubt.
distribution(tn['SCI'])
As with mathematics, the extent of moderation is quite high -- in fact, even higher than in mathematics. But does that actually lead to more students passing? Let's look at the percentegate of students that fail in each subject.
rcParams['figure.figsize'] = 4, 3
marks = tn[['LANG', 'ENG', 'MATH', 'SCI', 'SOC']]
((marks < 35).sum().astype(float) / marks.count()).order().plot(kind='bar')
rcParams['figure.figsize'] = 25, 10
So despite the extensive moderation in science, the fail ratio is the highest here.
Now let's move on to the next state.
ka = pd.read_csv('d:/site/gramener.com/viz/autolyse/data/karnataka-marks-2011-subfields.csv')
ka
<class 'pandas.core.frame.DataFrame'> Int64Index: 855446 entries, 0 to 855445 Data columns (total 30 columns): DIST_CODE 855446 non-null values SCHOOL_CODE 855446 non-null values DOB 855391 non-null values NRC_CASTE_CODE 855446 non-null values NRC_GENDER_CODE 855446 non-null values NRC_MEDIUM 855446 non-null values NRC_PHYSICAL_CONDITION 855446 non-null values TOTAL_MARKS 841629 non-null values Hindi 3rd 757012 non-null values Kannada 3rd 49814 non-null values English 3rd 24183 non-null values Arabic 3rd 134 non-null values Urdu 3rd 58 non-null values Sanskrit 3rd 5391 non-null values Science 833730 non-null values Mathematics 832720 non-null values Social Science 835894 non-null values Kannada 706916 non-null values Hindi 3693 non-null values Telugu 701 non-null values Marathi 16855 non-null values Indian Social Studies 81 non-null values Indian Political Science and Urban Policy 195 non-null values Indian Economics 207 non-null values Tamil 1899 non-null values Urdu 33316 non-null values English 56405 non-null values Sanskrit 17661 non-null values Kannada 2nd 80486 non-null values English 2nd 753933 non-null values dtypes: float64(23), object(7)
distribution(ka['English 2nd'])
distribution(ka['Kannada'])
distribution(ka['Mathematics'])
distribution(ka['Science'])
distribution(ka['Social Science'])
ka.groupby('NRC_GENDER_CODE')['TOTAL_MARKS'].mean()
NRC_GENDER_CODE B 303.662583 G 335.868927 Name: TOTAL_MARKS, dtype: float64
OK, how sure are we that girls score more? What's the probability boys score more than girls?
import scipy.stats
boys = ka[ka['NRC_GENDER_CODE'] == 'B']['TOTAL_MARKS']
girls = ka[ka['NRC_GENDER_CODE'] == 'G']['TOTAL_MARKS']
F, prob = scipy.stats.ttest_ind(boys.dropna(), girls.dropna())
prob
0.0
The data is from https://github.com/octonion/CISCEResults2013. It caused a media ruckus.
icse = pd.read_csv('d:/site/gramener.com/viz/edu/CISCEResults2013/icse_scores.csv', names=['Student', 'Subject', 'Mark'])
icse
<class 'pandas.core.frame.DataFrame'> Int64Index: 844836 entries, 0 to 844835 Data columns (total 3 columns): Student 844836 non-null values Subject 844836 non-null values Mark 844836 non-null values dtypes: object(3)
We do have to do a bit of cleanup on the marks:
numpy.sort(icse['Mark'].unique())
array(['09', '10', '100', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '24', '25', '26', '27', '28', '29', '30', '35', '38', '40', '42', '44', '46', '48', '50', '52', '54', '58', '60', '62', '64', '66', '69', '72', '74', '76', '78', '80', '83', '86', '88', '90', '92', '94', '95', '96', '97', '98', '99', 'A', 'ABS *', 'B', 'C', 'D', 'PCA *', 'PCNA *', 'SPCA *', 'SPCNA*', 'X', 'XXX'], dtype=object)
icse['Mark'] = icse['Mark'].replace({'A': '', 'ABS *': '', 'B': '', 'C': '', 'D': '', 'PCA *': '', 'PCNA *': '', 'SPCA *': '', 'SPCNA*': '', 'X': '', 'XXX': ''}).replace('', numpy.nan).astype(float)
Now let's take a look at the overall mark distribution across subjects.
distribution(icse['Mark'])
That's definitely weird. Let's break it up by subject.
icse['Subject'].value_counts().head(20)
Result 105644 ENG 105612 HCG 105608 SUPW 105592 MAT 102399 SCI 93756 HIN 69364 CTA 62564 EAS 13006 BEN 11997 CAS 11418 CST 8290 PED 7481 MAL 6350 ECO 5486 KAN 4580 HSC 4090 TEL 3255 ORI 3189 ART 2816 dtype: int64
distribution(icse[icse['Subject'] == 'ENG']['Mark'])
distribution(icse[icse['Subject'] == 'HCG']['Mark'])
distribution(icse[icse['Subject'] == 'MAT']['Mark'])
distribution(icse[icse['Subject'] == 'SCI']['Mark'])
distribution(icse[icse['Subject'] == 'HIN']['Mark'])
The data is from https://github.com/octonion/CISCEResults2013.
cbse = pd.read_csv('d:/site/gramener.com/viz/edu/CISCEResults2013/cbse_scores_revised.csv', names=['ROLLNO', 'SUB_ID', 'SUBJECT', 'MARK', 'GRADE', 'SOMETHING'])
cbse.head()
ROLLNO | SUB_ID | SUBJECT | MARK | GRADE | SOMETHING | |
---|---|---|---|---|---|---|
0 | 1699000 | 301 | ENGLISH CORE | 043 | D1 | False |
1 | 1699000 | 302 | HINDI CORE | 072 | B2 | False |
2 | 1699000 | 42 | PHYSICS | 047 | D2 | False |
3 | 1699000 | 43 | CHEMISTRY | 056 | D1 | False |
4 | 1699000 | 83 | COMPUTER SCIENCE | 048 | D2 | False |
There are a variety of subjects in class XII that one could choose from. It's quite a long tail, actually, as you can see below.
pd.Series(cbse['SUBJECT'].value_counts()).head(50).plot(kind='bar')
<matplotlib.axes.AxesSubplot at 0x29a303c8>
The marks are messed up, as always. We'll need to clean it up, but this is worse than usual.
cbse['MARK'].unique()
array(['043', '072', '047', '056', '048', '---', '014 F', nan, '061', '065', '055', '053', '050', '016F', '022F', '002F', '011FT', '016FT', '023 FT', '073', '080', '052', '070', '008 F', '059', '035FT', '049', '013 F', '036', '076', '058', '057', '064', '046', '045', '031 FT', '067', '054', '060', '083', '077', '033', '043 FT', '041', '051', '040', '004F', '028FT', '028 FT', '069', '063', '075', '085', '081', '018FT', '044', '032 FT', '030 FT', '030FT', '042FT', '039FT', '002 F', '026FT', '078', '084', '005F', '033FT', '029FT', '066', '010F', '012F', '031FT', '043FT', '036FT', '068', '086', '062', '074', '017FT', '071', '034FT', '082', '079', '042', '034', '094', '037FT', '041FT', '092', '090', '088', '021F', '014F', '027FT', '095', '035', '020F', '098', '009F', '087', '007F', '025FT', '024FT', '089', '032FT', '099', '038FT', '006F', '003F', '001F', '023FT', '021FT', '000F', '009FT', '038', '014FT', '022FT', '091', '008F', '019FT', '011F', '013F', '040FT', '015FT', '013FT', '093', '096', '097', '037', '039', '019F', '020FT', '044FT', '015F', '018F', '007FT', '017F', '008FT', '005 F', '045FT', '100', '009 F', 'ABA', '010FT', '024F', '012FT', '018 FT', '004 F', '064FT', '010 F', 'AB A', '011 F', '012 F', '049FT', '023F', '001 F', '000 F', '015 FT', '033 FT', '044 FT', '035 FT', '025 FT', '014 FT', '015 F', '047FT', '046FT', '037 FT', '051FT', '042 FT', '036 FT', '007 F', '016 F', '006 F', '034 FT', '040 FT', '021 FT', '026 FT', '027 FT', '007 FT', '025F', '048FT', '003 F', '047 FT', '046 FT', '038 FT', '066FT', '062FT', '057FT', '060FT', '050FP', '045 FP', '022 FT', '026F', '029 FT', '052 FT', '061FT', '024 FT', '045FP', '026FP', '029F', '028F', '024FP', '048FP', '023FP', '055FP', '050FT', '039 FT', '018 F', '017 F', '034FP', '020 FT', '042FP', '020 F', '027 FP', '040FP', '035FP', '031FP', '053FT', '047FP', '029FP', '069FT', '030FP', '036FP', '033FP', '048 FT', '045 FT', '041 FT', '073FT', '019 FT', '019 F', '039FP', '017 FT', '008 FT', '016 FT', '012 FT', '056FT', '022 F', '054FP', '006FT', '011 FT', '013 FT', '027F', '043FP', '058FT', '068FT', '037FP', '084FP', '028FP', '068FP', '025 FP', '051FP', '057FP', '062FP', '058FP', '027 F', '046FP', '070FT', '072FT', '074FT', '074 FT', '059 FT', '065FT', '049FP', '024 F', '067FT', '063 FT', '069 FT', '067 FT', '062 FT', '068 FT', '064 FT', '077 FT', '070 FT', '071 FT', '072 FT', '015 FP', '076 FT', '014 FP', '025FP', '071FT', '075FT', '065 FT', '028 FP', '050 FT', '004FT', '051 FP', '023 FP', '026 F', '058 FP', '023 F', '060 FT', '061 FT', '021 F', '025 F', '077FT', '010 FT', '005FT', '059FP', '059FT', '027FP', '057 FT', '053 FT', '032FP', '066 FT', '058 FT', '075 FT', '054FT', '063FT', '056 FT', '021 FP', '049 FT', '078FT', '034 FP', '030 FP', '010 FP', '052FT', '073 FT', '054 FT', '055 FT', '044 FP', '055FT', '013FP', '054 FP', '012 FP', '061FP', '060 FP', '076FT', '011 FP', '024 FP', '029 FP', '041FP', '032 FP', '010FP', '009 FT', '013 FP', '016FP', '038FP', '044FP', '062 FP', '031 FP', '017FP', '020FP', '019FP', '038 FP', '036 FP', '043 FP', '033 FP', '039 FP', '037 FP', '049 FP', '040 FP', '035 FP', '046 FP', '052 FP', '047 FP', '051 FT', '026 FP', '048 FP', '042 FP', '059 FP', '052FP', '066FP', '028 F', '022FP', '053FP', '050 FP', '017 FP', '020 FP', '018FP', '014FP', '009FP', '055 FP', '067FP', '016 FP', '018 FP', '064FP', '030F', '029 F', '060FP', '064 FP', '056FP', '065FP', '072FP', '041 FP', '021FP', '083FP', '012FP', '056 FP', '053 FP', '006 FT', '003FT', '004 FT', '070FP', '015FP', '022 FP', '005 FT', '057 FP'], dtype=object)
Let's just take the first 3 characters and process them. It turns out that we only need to worry about 3 other strings: '---', 'ABA', and 'AB ' if we only took the first 3 characters. Let's make them all NaNs.
cbse['SCORE'] = cbse['MARK'].dropna().str.slice(0,3).replace({'---':numpy.nan, 'ABA':numpy.nan, 'AB ':numpy.nan}).astype(float)
Now it's time to look at the distribution of the top subjects.
distribution(cbse[cbse['SUBJECT'] == 'ENGLISH CORE']['SCORE'])
distribution(cbse[cbse['SUBJECT'] == 'MATHEMATICS']['SCORE'])
distribution(cbse[cbse['SUBJECT'] == 'PHYSICS']['SCORE'])
distribution(cbse[cbse['SUBJECT'] == 'CHEMISTRY']['SCORE'])
distribution(cbse[cbse['SUBJECT'] == 'ECONOMICS']['SCORE'])
distribution(cbse[cbse['SUBJECT'] == 'PHYSICAL EDUCATION']['SCORE'])
distribution(cbse[cbse['SUBJECT'] == 'BUSINESS STUDIES']['SCORE'])
distribution(cbse[cbse['SUBJECT'] == 'ACCOUNTANCY']['SCORE'])
I haven't seen anything as weird as these in a very long time.
Let's take a look at the pairwise correlation of subjects:
tn_subjects = tn[['ENG', 'LANG', 'MATH', 'SCI', 'SOC']]
tn_subjects.corr()
ENG | LANG | MATH | SCI | SOC | |
---|---|---|---|---|---|
ENG | 1.000000 | 0.813484 | 0.815920 | 0.819564 | 0.800539 |
LANG | 0.813484 | 1.000000 | 0.766003 | 0.824509 | 0.809650 |
MATH | 0.815920 | 0.766003 | 1.000000 | 0.797137 | 0.777218 |
SCI | 0.819564 | 0.824509 | 0.797137 | 1.000000 | 0.819928 |
SOC | 0.800539 | 0.809650 | 0.777218 | 0.819928 | 1.000000 |
from IPython.display import HTML
def colour(data):
html = ['<table><thead><tr><th></th>']
for column in data.columns:
html.append('<th>' + column + '</th>')
html.append('</tr></thead><tbody>')
for index, row in data.iterrows():
html.append('<tr><th>' + index + '</th>')
for column, value in row.iteritems():
# Red is 0.75, Green is 0.85, beyond that is cyan
hue = (value - .75) / (.85 - .75) * 120 if value < .85 else 180
html.append('<td style="text-align:right;background-color:hsl({:.0f},100%,50%)">{:.2f}</td>'.format(hue, value))
html.append('</tr>')
html.append('</tbody></table>')
return HTML(''.join(html))
colour(tn_subjects.corr())
ENG | LANG | MATH | SCI | SOC | |
---|---|---|---|---|---|
ENG | 1.00 | 0.81 | 0.82 | 0.82 | 0.80 |
LANG | 0.81 | 1.00 | 0.77 | 0.82 | 0.81 |
MATH | 0.82 | 0.77 | 1.00 | 0.80 | 0.78 |
SCI | 0.82 | 0.82 | 0.80 | 1.00 | 0.82 |
SOC | 0.80 | 0.81 | 0.78 | 0.82 | 1.00 |
At Gramener, we built a visualisation library that makes this sort of thing easier. For example, the above correlation plot is built-in.
(I'm afraid you wont' be able to try out this piece of code without a license for the server. Sorry!)
from vis import SVG
HTML(SVG('clusterplot.svg',
width=300,
height=300,
data=tn_subjects,
scatter=120,
gradient=((.75, 'red'), (.80, 'yellow'), (.85, 'green')),
regression=True))
import color
HTML(SVG('subtreemap.svg',
width = 500,
height = 300,
data = tn,
keys = ['DIST'],
values = {'MARKS': 'mean', 'REGNO': len},
size = lambda v: v['REGNO'],
sort = lambda v: v.sort('REGNO', ascending=False),
text = lambda v: v['DIST'],
wrap = False,
color = lambda v: color.gradient(v['MARKS'] / 500, ((0.40, 'red'), (0.60, 'yellow'), (0.80, 'green'))),
aspect = 2,
padding = 3,
parents = False,
))