Notebook

Introduction¶

This notebook records my analysis for the Crowdstorming Research: Many analysts, one dataset project. You can find the datasets on the OpenScienceFramework project page.

The research questions for this project are:

1: Are soccer referees more likely to give red cards to dark skin toned players than light skin toned players?

2: Are soccer referees from countries high in skin-tone prejudice more likely to award red cards to dark skin toned players?

As part of the project, other reachers read a summarized version of the analysis plan and gave feedback. That feedback has been incorporated into this document. To see the earlier version of the document, before feedback, go here: ???. Where I made changes in response to feedback, I have noted it with the phrase, "In response to feedback".

Analysis Plan¶

Choosing the Model¶

The dependent variable in this analysis is the number of red cards given. In order to choose an appropriate statistical test, we'll need to look at the data as a whole. For ease of (my) use, we'll load the data and create a list of dicts. To check that we've done so successfully, we'll print out the first two rows.

In [1]:

import csv

data_reader = csv.DictReader(open('./data/new_crowdstorming.csv', 'r'))   # Open datafile
data = []
for c,row in enumerate(data_reader):    # Create list of dics from the DictReader object
    data.append(row)
print data[:2]

[{'weight': '72', 'nExp': '750', 'height': '177', 'player': 'Lucas Wilchez', 'meanExp': '0.396', 'rater2': '0.5', 'yellowReds': '0', 'leagueCountry': 'Spain', 'rater1': '0.25', 'nIAT': '712', 'seIAT': '0.000564112354334542', 'seExp': '0.0026964901062936', 'Alpha_3': 'GRC', 'yellowCards': '0', 'photoID': '95212.jpg', 'club': 'Real Zaragoza', 'birthday': '31.08.1983', 'goals': '0', 'ties': '0', 'defeats': '1', 'meanIAT': '0.326391469021736', 'refCountry': '1', 'refNum': '1', 'victories': '0', 'games': '1', 'position': 'Attacking Midfielder', 'redCards': '0', 'playerShort': 'lucas-wilchez'}, {'weight': '82', 'nExp': '49', 'height': '179', 'player': 'John Utaka', 'meanExp': '-0.204081632653061', 'rater2': '0.75', 'yellowReds': '0', 'leagueCountry': 'France', 'rater1': '0.75', 'nIAT': '40', 'seIAT': '0.0108748941063986', 'seExp': '0.0615044043187379', 'Alpha_3': 'ZMB', 'yellowCards': '1', 'photoID': '1663.jpg', 'club': 'Montpellier HSC', 'birthday': '08.01.1982', 'goals': '0', 'ties': '0', 'defeats': '1', 'meanIAT': '0.203374724564378', 'refCountry': '2', 'refNum': '2', 'victories': '0', 'games': '1', 'position': 'Right Winger', 'redCards': '0', 'playerShort': 'john-utaka'}]

Success! The next step is to check the distribution of the data so we can choose an appropriate model. The key here is the dependent variable, which will be the number of redCards. We can use matplotlib to take a look at the distribution of redCards awarded:

In [2]:

import matplotlib.pyplot as p  # Import for visualizations

redCards = [float(row["redCards"]) for row in data]

# Create a histogram
p.figure(1)
n, bins, patches = p.hist(redCards,bins=5,range=(0,5))
p.show()

The distribution is very clearly right-skewed. This makes the Poisson distribution a natural choice. Another key aspect of the Poisson distribution is that the data's mean and variance are approximately equal. We can test that too:

In [3]:

import numpy as n  # Import numpy library

print n.var([float(row["redCards"]) for row in data])
print n.mean([float(row["redCards"]) for row in data])

0.0127439009136
0.0125592352152

We can see that they're pretty close to equal. So we know what distribution to use for the dependent variable. Given that we have a number of independent variables to include, Poisson regression is a natural model choice.

Transforming the Data¶

What about the independent variables? What do those look like?

The independent variable of greatest interest is the skin tone ratings. Let's make sure they're reliable. To begin with, we apply scipy's normality test, which is based off the D'Agostino-Pearson normality test. This will help us decide what inter-rater reliability measure to use.

In [4]:

import scipy.stats as s 

# Check IRR of ratings
rater1 = [float(row["rater1"]) for row in data if "NA" not in [row["rater1"],row["rater2"]]]
rater2 = [float(row["rater2"]) for row in data if "NA" not in [row["rater1"],row["rater2"]]]

# Print results of scipy's normality test (based off D'Agostino-Pearson normality test)
print s.stats.normaltest(rater1, axis=0)
print s.stats.normaltest(rater2, axis=0)

(17807.487876428768, 0.0)
(16465.879873231894, 0.0)

The second value is a p-value -- for this normality test, a value less than .05 means that normality cannot be assumed. So we'll test inter-rater reliability with Spearman's rank correlation coefficient (rho), which does not assume normality.

In [5]:

print "Spearman: ", s.spearmanr(rater1,rater2)

Spearman:  (0.85764169209866958, 0.0)

86% is a reasonable correlation. We can apply a simple transformation to average the ratings (see below).

Before continuing on, I'm going to reformat this data into a pandas dataframe. I'll do this by importing the pandas library, reading in the data again, and selecting out some important rows. Again, we'll print out the first few rows to make sure it looks sensible.

In [6]:

import pandas

df = pandas.read_csv("./data/new_crowdstorming.csv")
keys = ['playerShort','refNum','games','goals','yellowCards','redCards','position','meanIAT','meanExp', 'rater1', 'rater2','club','leagueCountry']
df = df[keys]
print df[:3]

     playerShort  refNum  games  goals  yellowCards  redCards  \
0  lucas-wilchez       1      1      0            0         0   
1     john-utaka       2      1      0            1         0   
2    abdon-prats       3      1      0            1         0   

               position   meanIAT   meanExp  rater1  rater2             club  \
0  Attacking Midfielder  0.326391  0.396000    0.25    0.50    Real Zaragoza   
1          Right Winger  0.203375 -0.204082    0.75    0.75  Montpellier HSC   
2                   NaN  0.369894  0.588297     NaN     NaN     RCD Mallorca   

  leagueCountry  
0         Spain  
1        France  
2         Spain

/usr/local/lib/python2.7/dist-packages/pandas/io/excel.py:626: UserWarning: Installed openpyxl is not supported at this time. Use >=1.6.1 and <2.0.0.
  .format(openpyxl_compat.start_ver, openpyxl_compat.stop_ver))

Looks good!

As mentioned above, we will average the two ratings:

In [7]:

# Drop NA ratings and make an average
df = df.dropna(subset=['rater1','rater2'])
df['rating'] = (df['rater1'] + df['rater2']) / 2
print df[:3]

     playerShort  refNum  games  goals  yellowCards  redCards  \
0  lucas-wilchez       1      1      0            0         0   
1     john-utaka       2      1      0            1         0   
5   aaron-hughes       4      1      0            0         0   

               position   meanIAT   meanExp  rater1  rater2             club  \
0  Attacking Midfielder  0.326391  0.396000    0.25    0.50    Real Zaragoza   
1          Right Winger  0.203375 -0.204082    0.75    0.75  Montpellier HSC   
5           Center Back  0.325185  0.538462    0.25    0.00        Fulham FC   

  leagueCountry  rating  
0         Spain   0.375  
1        France   0.750  
5       England   0.125

In addition to the independent variable of interest, there are a number of potential moderating variables. These include:

The number of yellow cards given (['yellowCards']). Although it seems likely that number of yellow cards are correlated with number of red cards, the role of the yellow card as a "caution" or "warning" indicates the potential for a more complicated role. For instance, players with lighter skin might disproportionately receive yellow cards and those with darker skin red cards for the same behavior. This already exists as a variable, so no transformation was necessary.
Number of goals by player (['goals']). A player's success may enhance or attenuate bias.
In response to feedback, I added player position as a variable. It seemed plausible that some positions would not differ from each other in their influence on the variable redCards, for instance right midfielder vs left midfielder. To test that intuition, I compared redCards per game per position:

In [8]:

positions = df.groupby(['position'])
positions['goals'].mean() / positions['games'].mean()

Out[8]:

position
Attacking Midfielder    0.162301
Center Back             0.051606
Center Forward          0.319852
Center Midfielder       0.122160
Defensive Midfielder    0.061550
Goalkeeper              0.000303
Left Fullback           0.038574
Left Midfielder         0.126883
Left Winger             0.231471
Right Fullback          0.034196
Right Midfielder        0.119648
Right Winger            0.245183
dtype: float64

From this, we can create a smaller number of positions by binning rates of redCards in the bins of: 0-.01; .01-.1; .1-.2; .2-.3; .3-.4. This created 5 categories for this variable: goalkeeper (0.000307); defense, which included defensive midfielder (0.060537), center back (0.051464), left fullback (0.037560), and right fullback (0.033997); midfield, which included center midfielder (0.119374), left midfielder (0.125318) and right midfielder (0.120007) and attacking midfielder (0.159464); offense, which included left winger (0.228310) and right winger (0.239445); and center forward (0.316328). If there was no position data for a dyad, it was dropped. This new variable was called 'positionGroup':

In [9]:

before_drop = len(df)
df = df.dropna(subset=['position'])  
print "rows dropped: ", before_drop - len(df)

def f(x):
    positionDict = {"Attacking Midfielder" : "Midfield", "Center Back" : "Defense", "Center Forward" : "Offense", "Center Midfielder" : "Midfield", "Defensive Midfielder" : "Defense", "Goalkeeper" : "Goalkeeper", "Left Fullback" : "Defense", "Left Midfielder" : "Midfield", "Left Winger" : "Offense", "Right Fullback" : "Defense", "Right Midfielder" : "Midfield", "Right Winger" : "Offence"}
    return positionDict[x]
df['positionGroup'] = df['position'].apply(f,1)

print df[:10]['positionGroup']

rows dropped:  8461
0       Midfield
1        Offence
5        Defense
6        Defense
7        Defense
8     Goalkeeper
9        Defense
10       Defense
11       Offense
12       Defense
Name: positionGroup, dtype: object

We can see here that 17,726 rows were dropped, and that we successfully created the new column.

Although it was suggested that I address the hierarchical nature of the data, perhaps by adding fixed effects for club and league, more information about the dataset revealed that the information given for club and league was incomplete, and represented only the first club/league the player played in. I therefore chose not to include this variable.
Measures of bias in country of origin of referee (['meanIAT'], ['meanExp']). See question 2. These were scaled to be more readable:

In [10]:

before_drop = len(df)
df = df.dropna(subset=['meanIAT', 'meanExp'])
print "rows dropped: ", before_drop - len(df)

df['meanIAT'] = df['meanIAT'] * 100
df['meanExp'] = df['meanExp'] * 100
print df[:3]

rows dropped:  146
     playerShort  refNum  games  goals  yellowCards  redCards  \
0  lucas-wilchez       1      1      0            0         0   
1     john-utaka       2      1      0            1         0   
5   aaron-hughes       4      1      0            0         0   

               position    meanIAT    meanExp  rater1  rater2  \
0  Attacking Midfielder  32.639147  39.600000    0.25    0.50   
1          Right Winger  20.337472 -20.408163    0.75    0.75   
5           Center Back  32.518515  53.846154    0.25    0.00   

              club leagueCountry  rating positionGroup  
0    Real Zaragoza         Spain   0.375      Midfield  
1  Montpellier HSC        France   0.750       Offence  
5        Fulham FC       England   0.125       Defense

The number of games played (['games']). A larger number of games provide more opportunities for a red card to be issued. In response to feedback, this was made into an 'exposure variable'. Because it is given to the model as a one-dimensional array or vector, and not as part of the dataframe, I didn't make it into a new variable. I calculated this last, once missing data had been dropped while exploring other variables.

In [11]:

exposure_array = df['games'].values

How to Interpret Our Results¶

To answer question 1, we'll look at the coefficient of regression for the variable ratings.

To answer question 2a, we'll look at the coefficient of regression of the interaction term for the variable ratings and the meanIAT.

To answer question 2b, we'll look at the coefficient of regression of the interaction term for the variable ratings and the meanExp.

Okay, let's begin!

Analysis¶

In [12]:

import numpy as np 
import statsmodels.api as sm
from patsy import dmatrices

# Create + fit poisson model
def test_question(y, X, exposure_array):  
    poisson_mod = sm.Poisson(y, X, exposure=exposure_array)
    poisson_res = poisson_mod.fit()
    print poisson_res.summary()
    # print np.exp(poisson_res.params)

This function takes a patsy dmatrices frame and fits a Poisson model with it, using exposure_array (defined above) as an exposure variable. We multiply each of our variables by the variable of interest, rating, in order to determine the unique effects of rating when other variables are controlled for:

In [13]:

# Define x and y 
y, X = dmatrices('redCards ~ rating + rating*goals + rating*positionGroup + rating*yellowCards + rating*meanIAT + rating*meanExp', data=df, return_type='dataframe')

test_question(y, X, exposure_array)

Optimization terminated successfully.
         Current function value: 0.062406
         Iterations 11
                          Poisson Regression Results                          
==============================================================================
Dep. Variable:               redCards   No. Observations:               116014
Model:                        Poisson   Df Residuals:                   115996
Method:                           MLE   Df Model:                           17
Date:                Thu, 24 Jul 2014   Pseudo R-squ.:                 0.08945
Time:                        12:18:53   Log-Likelihood:                -7240.0
converged:                       True   LL-Null:                       -7951.2
                                        LLR p-value:                2.312e-292
======================================================================================================
                                         coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------------------------------
Intercept                             -7.5088      0.790     -9.505      0.000        -9.057    -5.960
positionGroup[T.Goalkeeper]            0.0383      0.133      0.288      0.774        -0.223     0.300
positionGroup[T.Midfield]             -0.5870      0.106     -5.558      0.000        -0.794    -0.380
positionGroup[T.Offence]              -0.1680      0.225     -0.746      0.456        -0.610     0.274
positionGroup[T.Offense]              -0.4165      0.126     -3.314      0.001        -0.663    -0.170
rating                                 2.1084      1.525      1.382      0.167        -0.881     5.098
rating:positionGroup[T.Goalkeeper]    -0.9424      0.536     -1.759      0.079        -1.993     0.108
rating:positionGroup[T.Midfield]       0.4554      0.277      1.644      0.100        -0.088     0.998
rating:positionGroup[T.Offence]       -0.1720      0.427     -0.403      0.687        -1.009     0.664
rating:positionGroup[T.Offense]        0.0743      0.259      0.287      0.774        -0.434     0.583
goals                                 -0.0251      0.027     -0.926      0.355        -0.078     0.028
rating:goals                          -0.0098      0.065     -0.149      0.881        -0.138     0.118
yellowCards                            0.0588      0.026      2.249      0.025         0.008     0.110
rating:yellowCards                    -0.1284      0.072     -1.779      0.075        -0.270     0.013
meanIAT                                0.0607      0.026      2.309      0.021         0.009     0.112
rating:meanIAT                        -0.0547      0.051     -1.071      0.284        -0.155     0.045
meanExp                               -0.0001      0.004     -0.037      0.970        -0.007     0.007
rating:meanExp                         0.0039      0.008      0.514      0.607        -0.011     0.019
======================================================================================================

Results¶

The key results here are in the first column - the regression coefficients. A regression coefficient can be interpreted as: "for a one unit change in the predictor variable, the difference in the logs of expected counts is expected to change by the respective regression coefficient, given the other predictor variables in the model are held constant." (source) In order to better interpret this result, as well as to provide a standardized results that can be compared directly to other researchers' analyses, we'll transform this into an incidence rate ratio. We can do this by exponentiating the coefficients using numpy's exp function:

In [14]:

# The same function as above, but displaying different output.
def test_question(y, X, exposure_array):  
    poisson_mod = sm.Poisson(y, X, exposure=exposure_array)
    poisson_res = poisson_mod.fit()
    # print poisson_res.summary()
    print np.exp(poisson_res.params)

test_question(y, X, exposure_array)

Optimization terminated successfully.
         Current function value: 0.062406
         Iterations 11
Intercept                             0.000548
positionGroup[T.Goalkeeper]           1.039091
positionGroup[T.Midfield]             0.555981
positionGroup[T.Offence]              0.845336
positionGroup[T.Offense]              0.659383
rating                                8.235204
rating:positionGroup[T.Goalkeeper]    0.389701
rating:positionGroup[T.Midfield]      1.576794
rating:positionGroup[T.Offence]       0.841944
rating:positionGroup[T.Offense]       1.077177
goals                                 0.975218
rating:goals                          0.990290
yellowCards                           1.060577
rating:yellowCards                    0.879504
meanIAT                               1.062622
rating:meanIAT                        0.946804
meanExp                               0.999861
rating:meanExp                        1.003863
dtype: float64

Question 1¶

We can interpret these exponentiated coeffients (IRRs) as the change to the rate of the dependent variable (redCards) given an increase in one unit of the predictor variable (rating). Because interaction terms have been included, this is the change when all other variables are zero.

This result, 8.235204, can be interpreted as, "For each increase of 1 in the rating variable, a player is 8.235204 times more likely to receive a red card." Keep in mind, though, that ratings range from 0-1. An alternative interpretation is thus, "A player whose skin was rated as darkest is 8.235204 more likely than a player whose skin was rated lighest to receive a red card."

It is also worth noting the standard error, confidence interval, etc in the original summary:

coef    std err          z      P>|z|      [95.0% Conf. Int.]
2.1084      1.525      1.382      0.167        -0.881     5.098

It is hard to have much confidence in our results.

Question 2a¶

For this question, we want to look not just at the influence of ratings on redCards but the influence of meanIAT on the influence of ratings on redCards. The interaction coefficient is therefore what we're interested in:

coef    std err          z      P>|z|      [95.0% Conf. Int.]
-0.0547      0.051     -1.071      0.284        -0.155     0.045

The IRR is:

rating:meanIAT                        0.946804

This result shows the increased influence of ratings on redCards based on a difference of 1 meanIAT.

Question 2b¶

For this question, we want to look not just at the influence of ratings on redCards but the influence of meanExp on the influence of ratings on redCards. The interaction coefficient is therefore what we're interested in:

coef    std err          z      P>|z|      [95.0% Conf. Int.]
0.0039      0.008      0.514      0.607        -0.011     0.019

The IRR is:

rating:meanIAT                        1.003863

This result shows the increased influence of ratings on redCards based on a difference of 1 meanExp.

Conclusion¶

In this analysis, we asked two questions: whether soccer referees were more likely to give red cards to dark skin toned players than light skin toned players, and whether implicit or explicit skin-tone prejudice in the country of origin of the referee moderated this preference. To test this question, we used a dataset consisting of dyads of players and referees, the number of red cards given by referees to the players, ratings of the player's skin-tone, and several other variables of interest. Given the heavily right-skewed count distribution of the red cards, we chose to use a poisson regression as our model. Number of games was used as an exposure variable, with goals, position, yellow cards, and country-of-origin-wide bias measures, meanIAT (implicit) and meanExp (explicit) included in the model. Although the data was hierarchical, with players embedded in teams embedded in leagues, the information provided was deemed inadequate to use for a hierarchical model. All dyads missing any of this data was excluded, leaving X total observations. We found an incidence rate ratio of 8.235204 for ratings, suggesting that a player whose skin was rated as darkest was 8.235204 more likely than a player whose skin was rated lighest to receive a red card. However this difference was not significant, with a large error (1.525) and wide confidence intervals (-0.881, 5.098) around the regression coefficient found (2.1084). Our analysis found a light negative impact of meanIAT on the influence of skin-tone in the form of an IRR of 0.946804. Again, it is difficult to have confidence in the result (coeff: -0.0547; std error: 0.051; p: 0.284; CI: -0.155, 0.045). Finally we found a slight positive impact of meanExp on the influence of skin-tone in the form of an IRR of 1.003863 (coeff: 0.0039; std error: 0.008; p: 0.607; CI: -0.011, 0.019). These results do not allow us to infer an influence of skin-tone on the # of red cards given. This does not allow us to say that an effect does not exist: a different methodology, or a different dataset (perhaps one with IAT or Exp measures for individual referees, or more comprehensive team and country data allowing hierarchical modeling) might find better evidence.