This notebook records my analysis for the Crowdstorming Research: Many analysts, one dataset project. You can find the datasets on the OpenScienceFramework project page.
The research questions for this project are:
1: Are soccer referees more likely to give red cards to dark skin toned players than light skin toned players?
2: Are soccer referees from countries high in skin-tone prejudice more likely to award red cards to dark skin toned players?
As part of the project, other reachers read a summarized version of the analysis plan and gave feedback. That feedback has been incorporated into this document. To see the earlier version of the document, before feedback, go here: ???. Where I made changes in response to feedback, I have noted it with the phrase, "In response to feedback".
The dependent variable in this analysis is the number of red cards given. In order to choose an appropriate statistical test, we'll need to look at the data as a whole. For ease of (my) use, we'll load the data and create a list of dicts. To check that we've done so successfully, we'll print out the first two rows.
import csv
data_reader = csv.DictReader(open('./data/new_crowdstorming.csv', 'r')) # Open datafile
data = []
for c,row in enumerate(data_reader): # Create list of dics from the DictReader object
data.append(row)
print data[:2]
[{'weight': '72', 'nExp': '750', 'height': '177', 'player': 'Lucas Wilchez', 'meanExp': '0.396', 'rater2': '0.5', 'yellowReds': '0', 'leagueCountry': 'Spain', 'rater1': '0.25', 'nIAT': '712', 'seIAT': '0.000564112354334542', 'seExp': '0.0026964901062936', 'Alpha_3': 'GRC', 'yellowCards': '0', 'photoID': '95212.jpg', 'club': 'Real Zaragoza', 'birthday': '31.08.1983', 'goals': '0', 'ties': '0', 'defeats': '1', 'meanIAT': '0.326391469021736', 'refCountry': '1', 'refNum': '1', 'victories': '0', 'games': '1', 'position': 'Attacking Midfielder', 'redCards': '0', 'playerShort': 'lucas-wilchez'}, {'weight': '82', 'nExp': '49', 'height': '179', 'player': 'John Utaka', 'meanExp': '-0.204081632653061', 'rater2': '0.75', 'yellowReds': '0', 'leagueCountry': 'France', 'rater1': '0.75', 'nIAT': '40', 'seIAT': '0.0108748941063986', 'seExp': '0.0615044043187379', 'Alpha_3': 'ZMB', 'yellowCards': '1', 'photoID': '1663.jpg', 'club': 'Montpellier HSC', 'birthday': '08.01.1982', 'goals': '0', 'ties': '0', 'defeats': '1', 'meanIAT': '0.203374724564378', 'refCountry': '2', 'refNum': '2', 'victories': '0', 'games': '1', 'position': 'Right Winger', 'redCards': '0', 'playerShort': 'john-utaka'}]
Success! The next step is to check the distribution of the data so we can choose an appropriate model. The key here is the dependent variable, which will be the number of redCards. We can use matplotlib to take a look at the distribution of redCards awarded:
import matplotlib.pyplot as p # Import for visualizations
redCards = [float(row["redCards"]) for row in data]
# Create a histogram
p.figure(1)
n, bins, patches = p.hist(redCards,bins=5,range=(0,5))
p.show()
The distribution is very clearly right-skewed. This makes the Poisson distribution a natural choice. Another key aspect of the Poisson distribution is that the data's mean and variance are approximately equal. We can test that too:
import numpy as n # Import numpy library
print n.var([float(row["redCards"]) for row in data])
print n.mean([float(row["redCards"]) for row in data])
0.0127439009136 0.0125592352152
We can see that they're pretty close to equal. So we know what distribution to use for the dependent variable. Given that we have a number of independent variables to include, Poisson regression is a natural model choice.
What about the independent variables? What do those look like?
The independent variable of greatest interest is the skin tone ratings. Let's make sure they're reliable. To begin with, we apply scipy's normality test, which is based off the D'Agostino-Pearson normality test. This will help us decide what inter-rater reliability measure to use.
import scipy.stats as s
# Check IRR of ratings
rater1 = [float(row["rater1"]) for row in data if "NA" not in [row["rater1"],row["rater2"]]]
rater2 = [float(row["rater2"]) for row in data if "NA" not in [row["rater1"],row["rater2"]]]
# Print results of scipy's normality test (based off D'Agostino-Pearson normality test)
print s.stats.normaltest(rater1, axis=0)
print s.stats.normaltest(rater2, axis=0)
(17807.487876428768, 0.0) (16465.879873231894, 0.0)
The second value is a p-value -- for this normality test, a value less than .05 means that normality cannot be assumed. So we'll test inter-rater reliability with Spearman's rank correlation coefficient (rho), which does not assume normality.
print "Spearman: ", s.spearmanr(rater1,rater2)
Spearman: (0.85764169209866958, 0.0)
86% is a reasonable correlation. We can apply a simple transformation to average the ratings (see below).
Before continuing on, I'm going to reformat this data into a pandas dataframe. I'll do this by importing the pandas library, reading in the data again, and selecting out some important rows. Again, we'll print out the first few rows to make sure it looks sensible.
import pandas
df = pandas.read_csv("./data/new_crowdstorming.csv")
keys = ['playerShort','refNum','games','goals','yellowCards','redCards','position','meanIAT','meanExp', 'rater1', 'rater2','club','leagueCountry']
df = df[keys]
print df[:3]
playerShort refNum games goals yellowCards redCards \ 0 lucas-wilchez 1 1 0 0 0 1 john-utaka 2 1 0 1 0 2 abdon-prats 3 1 0 1 0 position meanIAT meanExp rater1 rater2 club \ 0 Attacking Midfielder 0.326391 0.396000 0.25 0.50 Real Zaragoza 1 Right Winger 0.203375 -0.204082 0.75 0.75 Montpellier HSC 2 NaN 0.369894 0.588297 NaN NaN RCD Mallorca leagueCountry 0 Spain 1 France 2 Spain
/usr/local/lib/python2.7/dist-packages/pandas/io/excel.py:626: UserWarning: Installed openpyxl is not supported at this time. Use >=1.6.1 and <2.0.0. .format(openpyxl_compat.start_ver, openpyxl_compat.stop_ver))
Looks good!
As mentioned above, we will average the two ratings:
# Drop NA ratings and make an average
df = df.dropna(subset=['rater1','rater2'])
df['rating'] = (df['rater1'] + df['rater2']) / 2
print df[:3]
playerShort refNum games goals yellowCards redCards \ 0 lucas-wilchez 1 1 0 0 0 1 john-utaka 2 1 0 1 0 5 aaron-hughes 4 1 0 0 0 position meanIAT meanExp rater1 rater2 club \ 0 Attacking Midfielder 0.326391 0.396000 0.25 0.50 Real Zaragoza 1 Right Winger 0.203375 -0.204082 0.75 0.75 Montpellier HSC 5 Center Back 0.325185 0.538462 0.25 0.00 Fulham FC leagueCountry rating 0 Spain 0.375 1 France 0.750 5 England 0.125
In addition to the independent variable of interest, there are a number of potential moderating variables. These include:
positions = df.groupby(['position'])
positions['goals'].mean() / positions['games'].mean()
position Attacking Midfielder 0.162301 Center Back 0.051606 Center Forward 0.319852 Center Midfielder 0.122160 Defensive Midfielder 0.061550 Goalkeeper 0.000303 Left Fullback 0.038574 Left Midfielder 0.126883 Left Winger 0.231471 Right Fullback 0.034196 Right Midfielder 0.119648 Right Winger 0.245183 dtype: float64
From this, we can create a smaller number of positions by binning rates of redCards in the bins of: 0-.01; .01-.1; .1-.2; .2-.3; .3-.4. This created 5 categories for this variable: goalkeeper (0.000307); defense, which included defensive midfielder (0.060537), center back (0.051464), left fullback (0.037560), and right fullback (0.033997); midfield, which included center midfielder (0.119374), left midfielder (0.125318) and right midfielder (0.120007) and attacking midfielder (0.159464); offense, which included left winger (0.228310) and right winger (0.239445); and center forward (0.316328). If there was no position data for a dyad, it was dropped. This new variable was called 'positionGroup':
before_drop = len(df)
df = df.dropna(subset=['position'])
print "rows dropped: ", before_drop - len(df)
def f(x):
positionDict = {"Attacking Midfielder" : "Midfield", "Center Back" : "Defense", "Center Forward" : "Offense", "Center Midfielder" : "Midfield", "Defensive Midfielder" : "Defense", "Goalkeeper" : "Goalkeeper", "Left Fullback" : "Defense", "Left Midfielder" : "Midfield", "Left Winger" : "Offense", "Right Fullback" : "Defense", "Right Midfielder" : "Midfield", "Right Winger" : "Offence"}
return positionDict[x]
df['positionGroup'] = df['position'].apply(f,1)
print df[:10]['positionGroup']
rows dropped: 8461 0 Midfield 1 Offence 5 Defense 6 Defense 7 Defense 8 Goalkeeper 9 Defense 10 Defense 11 Offense 12 Defense Name: positionGroup, dtype: object
We can see here that 17,726 rows were dropped, and that we successfully created the new column.
before_drop = len(df)
df = df.dropna(subset=['meanIAT', 'meanExp'])
print "rows dropped: ", before_drop - len(df)
df['meanIAT'] = df['meanIAT'] * 100
df['meanExp'] = df['meanExp'] * 100
print df[:3]
rows dropped: 146 playerShort refNum games goals yellowCards redCards \ 0 lucas-wilchez 1 1 0 0 0 1 john-utaka 2 1 0 1 0 5 aaron-hughes 4 1 0 0 0 position meanIAT meanExp rater1 rater2 \ 0 Attacking Midfielder 32.639147 39.600000 0.25 0.50 1 Right Winger 20.337472 -20.408163 0.75 0.75 5 Center Back 32.518515 53.846154 0.25 0.00 club leagueCountry rating positionGroup 0 Real Zaragoza Spain 0.375 Midfield 1 Montpellier HSC France 0.750 Offence 5 Fulham FC England 0.125 Defense
exposure_array = df['games'].values
To answer question 1, we'll look at the coefficient of regression for the variable ratings.
To answer question 2a, we'll look at the coefficient of regression of the interaction term for the variable ratings and the meanIAT.
To answer question 2b, we'll look at the coefficient of regression of the interaction term for the variable ratings and the meanExp.
Okay, let's begin!
import numpy as np
import statsmodels.api as sm
from patsy import dmatrices
# Create + fit poisson model
def test_question(y, X, exposure_array):
poisson_mod = sm.Poisson(y, X, exposure=exposure_array)
poisson_res = poisson_mod.fit()
print poisson_res.summary()
# print np.exp(poisson_res.params)
This function takes a patsy dmatrices frame and fits a Poisson model with it, using exposure_array (defined above) as an exposure variable. We multiply each of our variables by the variable of interest, rating, in order to determine the unique effects of rating when other variables are controlled for:
# Define x and y
y, X = dmatrices('redCards ~ rating + rating*goals + rating*positionGroup + rating*yellowCards + rating*meanIAT + rating*meanExp', data=df, return_type='dataframe')
test_question(y, X, exposure_array)
Optimization terminated successfully. Current function value: 0.062406 Iterations 11 Poisson Regression Results ============================================================================== Dep. Variable: redCards No. Observations: 116014 Model: Poisson Df Residuals: 115996 Method: MLE Df Model: 17 Date: Thu, 24 Jul 2014 Pseudo R-squ.: 0.08945 Time: 12:18:53 Log-Likelihood: -7240.0 converged: True LL-Null: -7951.2 LLR p-value: 2.312e-292 ====================================================================================================== coef std err z P>|z| [95.0% Conf. Int.] ------------------------------------------------------------------------------------------------------ Intercept -7.5088 0.790 -9.505 0.000 -9.057 -5.960 positionGroup[T.Goalkeeper] 0.0383 0.133 0.288 0.774 -0.223 0.300 positionGroup[T.Midfield] -0.5870 0.106 -5.558 0.000 -0.794 -0.380 positionGroup[T.Offence] -0.1680 0.225 -0.746 0.456 -0.610 0.274 positionGroup[T.Offense] -0.4165 0.126 -3.314 0.001 -0.663 -0.170 rating 2.1084 1.525 1.382 0.167 -0.881 5.098 rating:positionGroup[T.Goalkeeper] -0.9424 0.536 -1.759 0.079 -1.993 0.108 rating:positionGroup[T.Midfield] 0.4554 0.277 1.644 0.100 -0.088 0.998 rating:positionGroup[T.Offence] -0.1720 0.427 -0.403 0.687 -1.009 0.664 rating:positionGroup[T.Offense] 0.0743 0.259 0.287 0.774 -0.434 0.583 goals -0.0251 0.027 -0.926 0.355 -0.078 0.028 rating:goals -0.0098 0.065 -0.149 0.881 -0.138 0.118 yellowCards 0.0588 0.026 2.249 0.025 0.008 0.110 rating:yellowCards -0.1284 0.072 -1.779 0.075 -0.270 0.013 meanIAT 0.0607 0.026 2.309 0.021 0.009 0.112 rating:meanIAT -0.0547 0.051 -1.071 0.284 -0.155 0.045 meanExp -0.0001 0.004 -0.037 0.970 -0.007 0.007 rating:meanExp 0.0039 0.008 0.514 0.607 -0.011 0.019 ======================================================================================================
The key results here are in the first column - the regression coefficients. A regression coefficient can be interpreted as: "for a one unit change in the predictor variable, the difference in the logs of expected counts is expected to change by the respective regression coefficient, given the other predictor variables in the model are held constant." (source) In order to better interpret this result, as well as to provide a standardized results that can be compared directly to other researchers' analyses, we'll transform this into an incidence rate ratio. We can do this by exponentiating the coefficients using numpy's exp function:
# The same function as above, but displaying different output.
def test_question(y, X, exposure_array):
poisson_mod = sm.Poisson(y, X, exposure=exposure_array)
poisson_res = poisson_mod.fit()
# print poisson_res.summary()
print np.exp(poisson_res.params)
test_question(y, X, exposure_array)
Optimization terminated successfully. Current function value: 0.062406 Iterations 11 Intercept 0.000548 positionGroup[T.Goalkeeper] 1.039091 positionGroup[T.Midfield] 0.555981 positionGroup[T.Offence] 0.845336 positionGroup[T.Offense] 0.659383 rating 8.235204 rating:positionGroup[T.Goalkeeper] 0.389701 rating:positionGroup[T.Midfield] 1.576794 rating:positionGroup[T.Offence] 0.841944 rating:positionGroup[T.Offense] 1.077177 goals 0.975218 rating:goals 0.990290 yellowCards 1.060577 rating:yellowCards 0.879504 meanIAT 1.062622 rating:meanIAT 0.946804 meanExp 0.999861 rating:meanExp 1.003863 dtype: float64
We can interpret these exponentiated coeffients (IRRs) as the change to the rate of the dependent variable (redCards) given an increase in one unit of the predictor variable (rating). Because interaction terms have been included, this is the change when all other variables are zero.
This result, 8.235204, can be interpreted as, "For each increase of 1 in the rating variable, a player is 8.235204 times more likely to receive a red card." Keep in mind, though, that ratings range from 0-1. An alternative interpretation is thus, "A player whose skin was rated as darkest is 8.235204 more likely than a player whose skin was rated lighest to receive a red card."
It is also worth noting the standard error, confidence interval, etc in the original summary:
coef std err z P>|z| [95.0% Conf. Int.]
2.1084 1.525 1.382 0.167 -0.881 5.098
It is hard to have much confidence in our results.
For this question, we want to look not just at the influence of ratings on redCards but the influence of meanIAT on the influence of ratings on redCards. The interaction coefficient is therefore what we're interested in:
coef std err z P>|z| [95.0% Conf. Int.]
-0.0547 0.051 -1.071 0.284 -0.155 0.045
The IRR is:
rating:meanIAT 0.946804
This result shows the increased influence of ratings on redCards based on a difference of 1 meanIAT.
For this question, we want to look not just at the influence of ratings on redCards but the influence of meanExp on the influence of ratings on redCards. The interaction coefficient is therefore what we're interested in:
coef std err z P>|z| [95.0% Conf. Int.]
0.0039 0.008 0.514 0.607 -0.011 0.019
The IRR is:
rating:meanIAT 1.003863
This result shows the increased influence of ratings on redCards based on a difference of 1 meanExp.
In this analysis, we asked two questions: whether soccer referees were more likely to give red cards to dark skin toned players than light skin toned players, and whether implicit or explicit skin-tone prejudice in the country of origin of the referee moderated this preference. To test this question, we used a dataset consisting of dyads of players and referees, the number of red cards given by referees to the players, ratings of the player's skin-tone, and several other variables of interest. Given the heavily right-skewed count distribution of the red cards, we chose to use a poisson regression as our model. Number of games was used as an exposure variable, with goals, position, yellow cards, and country-of-origin-wide bias measures, meanIAT (implicit) and meanExp (explicit) included in the model. Although the data was hierarchical, with players embedded in teams embedded in leagues, the information provided was deemed inadequate to use for a hierarchical model. All dyads missing any of this data was excluded, leaving X total observations. We found an incidence rate ratio of 8.235204 for ratings, suggesting that a player whose skin was rated as darkest was 8.235204 more likely than a player whose skin was rated lighest to receive a red card. However this difference was not significant, with a large error (1.525) and wide confidence intervals (-0.881, 5.098) around the regression coefficient found (2.1084). Our analysis found a light negative impact of meanIAT on the influence of skin-tone in the form of an IRR of 0.946804. Again, it is difficult to have confidence in the result (coeff: -0.0547; std error: 0.051; p: 0.284; CI: -0.155, 0.045). Finally we found a slight positive impact of meanExp on the influence of skin-tone in the form of an IRR of 1.003863 (coeff: 0.0039; std error: 0.008; p: 0.607; CI: -0.011, 0.019). These results do not allow us to infer an influence of skin-tone on the # of red cards given. This does not allow us to say that an effect does not exist: a different methodology, or a different dataset (perhaps one with IAT or Exp measures for individual referees, or more comprehensive team and country data allowing hierarchical modeling) might find better evidence.