The UK Higher Education Research Excellence Framework, 2014 results are out, so I gave myself an hour to explore the data, see what's there, and get an idea for some of the more obvious stories we might try to pull out.
The data is published as an Excel spreadsheet, so let's grab a copy.
#Grab the data file from a copied URL
#!curl http://results.ref.ac.uk/\(S\(jjedtxoydmmvwidxuktryu15\)\)/DownloadFile/AllResults/xlsx > ref2014.xlsx
import pandas as pd
xls=pd.ExcelFile('ref2014.xlsx')
#How many sheets are there?
xls.sheet_names
So there's just a single sheet. Let's have a look at it.
#Quick preview of the data
df=pd.read_excel('ref2014.xlsx')
df[:10]
Row 3 (counting form 0) has column codes, row 6 has full column names. Let's load in the data as a simple dataframe using the full column names.
#We have some metadata rows so let's try again...
df=pd.read_excel('ref2014.xlsx',header=6)
df[:5]
The dataset is completely new to me so let's get a feel for some of the things it contains.
#What are the columns?
for col in df.columns: print(col)
Visual inspection of the first few lines of the dataframe shown above suggests that each institution puts in for different units of assessment, and each unit of assessment receives scores on four profile elements:
df['Profile'].unique()
#What are the units of assessment?
df['Unit of assessment name'].unique()
#How many institutions?
df['Institution code (UKPRN)'].unique().size
#How many institutions by unit of assessment, ordered?
#Rather than mulitpl count, let's base this on the Overall profile results
df[df['Profile']=='Overall'].groupby(['Unit of assessment name']).size().order(ascending=False)
#Which institutions submitted to most Units of Assessment?
df[df['Profile']=='Overall'].groupby(['Institution name']).size().order(ascending=False)
The way I phrased the previous question compared to the way I ran the query assumes that an insitution can only appear once per Unit of Assessment. Is that true?
#Can an institution have more than one submission to the same Units of Assessment?
df[df['Profile']=='Overall'].groupby(['Institution name','Unit of assessment name']).size().order(ascending=False)[:10]
Hmmm... so what's going on there then?
dfo=df[df['Profile']=='Overall']
dfo[(dfo['Institution name']=='King\'s College London') & (dfo['Unit of assessment name']=='Allied Health Professions, Dentistry, Nursing and Pharmacy')]
Ah, so we can have multiple submissions... Maybe I should have read some guidance about how the data is presented?!;-)
In that table above, the 4*, 3*, 2* numbers are pretty consistent. Is that unusual? (Perhaps file that thought for later if we have time.)
Folk always want league tables. How about we rank institutions in a unit of assessment? But what does rank mean? Let's start by just doing it on the basis of the Overall scores on the doors, omitting the number of staff submitted.
dfo[dfo['Unit of assessment name']=='Civil and Construction Engineering'].sort(['4*','3*','2*','1*','unclassified'],
ascending=False)[:5]
How does this compare with a ranking on the basis of another Profile? eg Impact?
dfi=df[df['Profile']=='Impact']
dfi[dfi['Unit of assessment name']=='Civil and Construction Engineering'].sort(['4*','3*','2*','1*','unclassified'],
ascending=False)[:5]
This may be an interesting canned question... Take in the name of a unit of assessment and a profile, and return the sorted table.
def rankUoAbyProfile(uoa,profile):
tmp=df[df['Profile']==profile]
tmp=tmp[tmp['Unit of assessment name']==uoa].sort(['4*','3*','2*','1*','unclassified'],ascending=False)
return tmp
rankUoAbyProfile('Computer Science and Informatics','Environment')[:5]
I guess another way of looking at insitutions is by the number of people they're submitting? Can we assume that each profile has the same number of FTEs associated with it? Go with that for now, but really should check.
def rankUoAbySize(uoa,profile='Overall'):
tmp=df[df['Profile']==profile]
tmp=tmp[tmp['Unit of assessment name']==uoa].sort(['FTE Category A staff submitted','4*','3*','2*','1*','unclassified'],ascending=False)
return tmp
rankUoAbySize('History')[:5]
Which submission attracted most entries?
def internalSize(hei):
tmp=dfo[dfo['Institution name']==hei]
tmp=tmp.sort(['FTE Category A staff submitted','4*','3*','2*','1*','unclassified'],ascending=False)
return tmp
internalSize("Open University")
How about ranking on the rankings within an institution?
def internalRanking(hei):
tmp=dfo[dfo['Institution name']==hei]
tmp=tmp.sort(['4*','3*','2*','1*','unclassified'],ascending=False)
return tmp
internalRanking("Open University")
Funding from Gateway to Research?
Okay... that's it, time up....