#!/usr/bin/env python # coding: utf-8 #

About author

# # My name is Dmitry Smirnov, currently I'm PhD student in Brain and Mind Laboratory, NBE department in Aalto University, Finland. I mostly do cognitive neuroscience with fMRI, using various classification and regression models. You can see more about me: # # My LinkedIn # # My personal page # #

About project

# # I've been reading and playing around with Bayesian methods for quite a while now, and in one conversation I was asked if these methods can help solve a research question that was answered with frequentist stats. I was eager to take the challenge, also to check how much did I really learn about this methodology. # # The data were obtained from previously published study of the influence of culture on the incidence of extreme judgment in moral dilemmas. Original study has showed, among other results, that in comparison with the English-speaking subjects, Russian subjects avoided extreme judgments (Arutyunova et al., 2013). I was provided with tables for Russian- and English-speaking samples, and given permission to run some Bayesian analysis on these. # # Reference: Arutyunova KR, Alexandrov YuI, Znakov VV, Hauser MD (2013) Moral Judgments in Russian Culture: Universality and Cultural Specificity. Journal of Cognition and Culture 13: 255–285. #

Importing packages, loading data and preprocessing

# In this notebook, I will use a hierarchical model. To give quick overview, in hierarchical model, I take in consideration not only each moral dilemma separately, but also allow the data from various dilemmas to influence group parameter estimates, and group parameter estimates can influence parameters of individual dilemmas. # In other words, each dilemma is an instance of some common group process of decision making. So I will model each dilemma, but tie the parameters of dilemmawise models to some group parameters. # I start with adding packages that I am going to use later in the code, and load data for Russian and English samples. Technically, dataRU and dataENG are just matrixes of Participant by Dilemma shape. # # I also setup matplotlib in a way that in throws plots into the notebook and setup up default grey grid for seaborn. # In[58]: import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import numpy as np import pymc3 from sklearn.cross_validation import KFold import scipy.stats as sp # In[59]: get_ipython().run_line_magic('matplotlib', 'inline') sns.set_style('darkgrid') # In[60]: dataRU = np.array(pd.read_csv('http://www.becs.aalto.fi/~smirnod1/RusKarina.csv',header=None)) dataENG = np.array(pd.read_csv('http://www.becs.aalto.fi/~smirnod1/EngKarina.csv',header=None)) # In[61]: print(dataRU.shape) print(dataENG.shape) # Now I transform our data, so that extreme answers (7 and 1) are coded as '1', and all other answers as '0'. Final matrices will be data1, which contains encoded (1-extreme, 0-moderate) responses from 327 individuals over 30 moral dilemmas from Russian-speaking sample, and data2 contains encoded responses from 330 individuals over 30 moral dilemmas from English-speaking sample. # In[62]: data1 = np.zeros(np.shape(dataRU)) data2 = np.zeros(np.shape(dataENG)) data1[dataRU == 7] = 1 data1[dataRU == 1] = 1 data2[dataENG == 7] = 1 data2[dataENG == 1] = 1 # For some of the later analyses I will need a few parameters, so let's start with them. # First, for this model I need to know the number of stories used, so S is just auxiliary variable for that. # Then, I also need to keep the indices for different stories, s_idx is for that. # # For Gamma priors later we need shape and rate parameters, I initialize them here, but see more explanation below. # In[63]: S = data1.shape[1] s_idx = [i for i in range(S)] s = 0.01 # shape for Gamma r = 0.01 # rate for Gamma nsamp = 100000 # How many samples to get from MCMC njobs = 4 # how many jobs to run in parallel s_idx = [i for i in range(S)] #

Model specification

# In Bayesian analysis, we need to have prior beliefs about the way our model is parametrized. Here I used approach taken by John Krushcke (Kruschke, 2014: http://www.indiana.edu/~kruschke/DoingBayesianDataAnalysis/). # #

# $$\kappa_g \sim \mathcal{Gamma}(\alpha=0.01,\beta=0.01)$$ # $$\omega_g \sim \mathcal{Beta}(1,1)$$ # $$\theta_{g,s} \sim \mathcal{Beta}(\omega_g(\kappa_g-2)+1), (1-\omega_g)(\kappa_g-2)+1)$$ # $$y_{g,s} \sim \mathcal{Ber}(\theta_{g,s})$$ #

# # In this case, for each group I have a Bernoulli likelihood function with parameter $\theta_{g,s}$. $\theta_{g,s}$ comes from Beta distribution which is parametrized with mode $\omega$ and concentration $\kappa$ hyperparameters. $\kappa$ gets Gamma prior with shape = 0.01 and rate = 0.01, which is a vague prior. $\omega$ gets uniform Beta(1,1) prior. # So, to give another overview, now for each group I have group mode $\omega$ and group concentration $\kappa$, which are informing story-specific $\theta_{g,s}$, and are informed by story-specific $\theta_{g,s}$. In applied sense that means, that I will get shrinkage effect, so if $\theta_{g,s}$ will be far away from group $\omega$, it will be pulled towards group estimate. # # For a little more intuitive explanation, we can think of our data as being two factories, each producing 30 coins. Each coin was flipped 300 times (or, more precisely, 327 in case of first factory, and 330 in case of second). We assume, that each factory produces coins with bias centered around some value, and the coins coming from single factory have bias close to that value. # # If we would have had some metric covariates, like age or size of social circle, it would be better to use logistic regression, but here I have none of those, so I keep the model simple. # In the bit of code below, I use PyMC3 to run Markov Chain Monte Carlo analysis with adaptive Metropolis-Hastings sampler. I will package the model into a function, because later on I will have to run it repeatedly during posterior predictive checking. # In[64]: def TrainHierarchicalModel(data1, data2, s_idx, njobs=2, s=0.01, r=0.01, nsamp=10000): with pymc3.Model() as model: # define the group hyperparameters k_minus_two1 = pymc3.Gamma('k_minus_two1', alpha=s, beta=r) k1 = pymc3.Deterministic('k1', k_minus_two1 + 2) w1 = pymc3.Beta('w1', alpha=1, beta=1) k_minus_two2 = pymc3.Gamma('k_minus_two2', alpha=s, beta=r) k2 = pymc3.Deterministic('k2', k_minus_two2 + 2) w2 = pymc3.Beta('w2', alpha=1, beta=1) # define the prior for group, want to tie story specific theta to it's group theta theta1 = pymc3.Beta('theta1', alpha=(w1 * k_minus_two1) + 1, beta=((1-w1)*k_minus_two1)+1, shape=S) theta2 = pymc3.Beta('theta2', alpha=(w2 * k_minus_two2) + 1, beta=((1-w2)*k_minus_two2)+1, shape=S) # define the likelihood y1 = pymc3.Bernoulli('y1', p=theta1[s_idx], observed=data1) y2 = pymc3.Bernoulli('y2', p=theta2[s_idx], observed=data2) wdiff = pymc3.Deterministic('wdiff', w1 - w2) thetadiff = pymc3.Deterministic('thetadiff', theta1 - theta2) oddsratio = pymc3.Deterministic('oddsratio', (theta2/(1-theta2))/(theta1/(1-theta1))) step = pymc3.Metropolis() # Instantiate MCMC sampling algorithm trace = pymc3.sample(nsamp/njobs, step, njobs=njobs, progressbar=False) return model,trace #

Running the model, inspecting the traces

# First, let's run the model on our whole dataset. I will plot the posterior (distribution of parameter values given data) and get some idea about the difference between our groups. # In[65]: _, trace = TrainHierarchicalModel(data1, data2, s_idx=s_idx, nsamp=nsamp, njobs=njobs) # As a result, the posterior distribution represents what we should believe, given the data, regarding the value of a parameter. This is not distribution of data, but distribution of parameters given the data. In frequentist approach one treats data as random, and parameters as fixed, and in bayesian approach it goes the other way (quite logical), as we have the data in our hands, so we treat it as fixed, and it's the underlying parameters of the process that generated the data we are unsure about. # # We can inspect trace visually to make sure that chains converge and there are no strange artifacts. I have seen a nice description of a good traceplot: "fat, hairy caterpillar", which doesn't bend (Lunn et al., 2012; Sorensen and Vasishth, 2015). # # There are multiple statistics that can be used for diagnostics, see Patrick Lam's slides for examples. I used Gelman-Rubin statistic to check that we achieved good mixing and convergence (it is around 1 for all parameters). # In[66]: # Remove semicolon to see the numbers pymc3.diagnostics.gelman_rubin(trace[5000:]); # In[67]: pymc3.traceplot(trace[5000:]); # To get a better idea about the actual results, I take out the traces, and drop first 5000 samples as warm-up. I use "combine=True" parameter, which concatenates multiple chains. I also compute highest posterior density intervals (HPD), which define 95% credible values of parameters. These can be used to guide decisions, for example, if zero value for a parameter falls into the HPD, then I have to conclude, that zero value of this parameter is likely. # # I extract the parameter values for $\theta_{g,s}$, $\omega_g$, thetadiff and wdiff. I take the mean over all stories here, to make it easier to display the results. I like to have my histograms with same x-axis scale, so that I can immediately compare them, so I get the range for my histograms from the traces. # In[68]: # Group RU theta1_sample = np.mean(trace[5000:].get_values('theta1',combine = True), axis=1) w1_sample = trace[5000:].get_values('w1',combine = True) # Group ENG theta2_sample = np.mean(trace[5000:].get_values('theta2',combine = True), axis=1) w2_sample = trace[5000:].get_values('w2',combine = True) # Difference of groups thetadiff_sample = np.mean(trace[5000:].get_values('thetadiff',combine = True), axis=1) wdiff_sample = trace[5000:].get_values('wdiff',combine = True) # HPDs hpd_theta1_sample = pymc3.stats.hpd(np.array(theta1_sample)) hpd_w1_sample = pymc3.stats.hpd(np.array(w1_sample)) hpd_theta2_sample = pymc3.stats.hpd(np.array(theta2_sample)) hpd_w2_sample = pymc3.stats.hpd(np.array(w2_sample)) hpd_thetadiff = pymc3.stats.hpd(np.array(thetadiff_sample)) hpd_wdiff = pymc3.stats.hpd(np.array(wdiff_sample)) # Histogram range histrange = list() histrange.append(np.min(np.concatenate((theta1_sample,theta2_sample)))) histrange.append(np.max(np.concatenate((theta1_sample,theta2_sample)))) histrange_w = list() histrange_w.append(np.min(np.concatenate((w1_sample,w2_sample)))) histrange_w.append(np.max(np.concatenate((w1_sample,w2_sample)))) #

Plot the histograms, inspect group differences

# Here I will plot the histograms of posterior distribution of averaged over stories $\theta_1$ - proportion of extreme judgments of Russian individuals, averaged over stories $\theta_2$ - proportion of extreme judgments of English-speaking individuals, and thetadiff - difference in these proportions. I also plot histograms of $\omega_g$, which represents the group proportion of extreme judgments. # In[69]: fig = plt.figure(figsize=(16, 8)) # Theta1 ax1 = plt.subplot2grid((3,2), (0,0)) ax1.set_autoscaley_on(True) bins = plt.hist(theta1_sample, histtype='stepfilled', bins=30, alpha=0.85, label="posterior of $theta_1$", color="#A60628", normed=True) plt.vlines(hpd_theta1_sample[0],0,max(bins[0])+1,linestyle = "--",linewidth = 1,color = 'red',label='HDI ({0:.3f} – {1:.3f}), mean={2:.2f}'.format(hpd_theta1_sample[0],hpd_theta1_sample[1],np.mean(theta1_sample))) plt.vlines(hpd_theta1_sample[1],0,max(bins[0])+1,linestyle = "--",linewidth = 1,color = 'red') plt.title(r"""Posterior distributions of the variables $\theta_1 (RU),\;\theta_2 (ENG),\;Difference\ of\ thetas$""") plt.legend(loc="upper right") plt.xlim(histrange) # Theta2 ax2 = plt.subplot2grid((3,2), (1,0)) ax2.set_autoscaley_on(True) bins = plt.hist(theta2_sample, histtype='stepfilled', bins=30, alpha=0.85, label="posterior of $theta_2$", color="#7A68A6", normed=True) plt.vlines(hpd_theta2_sample[0],0,max(bins[0])+1,linestyle = "--",linewidth = 1,color = 'purple',label='HDI ({0:.3f} – {1:.3f}), mean={2:.2f}'.format(hpd_theta2_sample[0],hpd_theta2_sample[1],np.mean(theta2_sample))) plt.vlines(hpd_theta2_sample[1],0,max(bins[0])+1,linestyle = "--",linewidth = 1,color = 'purple') plt.legend(loc="upper left") plt.xlim(histrange) # Thetadiff ax3 = plt.subplot2grid((3,2), (2,0)) bins = plt.hist(thetadiff_sample, histtype='stepfilled', bins=30, alpha=0.85, label=r"posterior of $thetadiff$", color="#467821") plt.vlines(hpd_thetadiff[0],0,max(bins[0])+1,linestyle = "--",linewidth = 1,color = 'green',label='HDI ({0:.3f} – {1:.3f}), mean={2:.2f}'.format(hpd_thetadiff[0],hpd_thetadiff[1],np.mean(thetadiff_sample))) plt.vlines(hpd_thetadiff[1],0,max(bins[0])+1,linestyle = "--",linewidth = 1,color = 'green') plt.vlines(0,0,max(bins[0])+1,linestyle = "-",linewidth = 2,color = 'green') plt.legend(loc="upper right") plt.xlim([min(thetadiff_sample),0.1]); # Omega1 ax4 = plt.subplot2grid((3,2), (0,1)) ax4.set_autoscaley_on(True) bins = plt.hist(w1_sample, histtype='stepfilled', bins=30, alpha=0.85, label="posterior of $\omega_1$", color="#A60628", normed=True) plt.vlines(hpd_w1_sample[0],0,max(bins[0])+1,linestyle = "--",linewidth = 1,color = 'red',label='HDI ({0:.3f} – {1:.3f}), mean={2:.2f}'.format(hpd_w1_sample[0],hpd_w1_sample[1],np.mean(w1_sample))) plt.vlines(hpd_w1_sample[1],0,max(bins[0])+1,linestyle = "--",linewidth = 1,color = 'red') plt.title(r"""Posterior distributions of the variables $\omega_1 (RU),\;\omega_2 (ENG),\;Difference\ of\ omegas$""") plt.legend(loc="upper left") plt.xlim(histrange_w) # Omega2 ax5 = plt.subplot2grid((3,2), (1,1)) ax5.set_autoscaley_on(True) bins = plt.hist(w2_sample, histtype='stepfilled', bins=30, alpha=0.85, label="posterior of $\omega_2$", color="#7A68A6", normed=True) plt.vlines(hpd_w2_sample[0],0,max(bins[0])+1,linestyle = "--",linewidth = 1,color = 'purple',label='HDI ({0:.3f} – {1:.3f}), mean={2:.2f}'.format(hpd_w2_sample[0],hpd_w2_sample[1],np.mean(w2_sample))) plt.vlines(hpd_w2_sample[1],0,max(bins[0])+1,linestyle = "--",linewidth = 1,color = 'purple') plt.legend(loc="upper left") plt.xlim(histrange_w) # Omegadiff ax6 = plt.subplot2grid((3,2), (2,1)) bins = plt.hist(wdiff_sample, histtype='stepfilled', bins=30, alpha=0.85, label=r"posterior of $wdiff$", color="#467821") plt.vlines(hpd_wdiff[0],0,max(bins[0])+1,linestyle = "--",linewidth = 1,color = 'green',label='HDI ({0:.3f} – {1:.3f}), mean={2:.2f}'.format(hpd_wdiff[0],hpd_wdiff[1],np.mean(wdiff_sample))) plt.vlines(hpd_wdiff[1],0,max(bins[0])+1,linestyle = "--",linewidth = 1,color = 'green') plt.vlines(0,0,max(bins[0])+1,linestyle = "-",linewidth = 2,color = 'green') plt.legend(loc="upper left") plt.xlim([min(wdiff_sample),0.01]); # Here we are, I can already see that thetas, averaged over stories, are clearly different in two groups. If we look at the left side of the plot, the green histogram shows the difference, and bold green vertical line shows zero, which is far away from HDI. If we look at the omega hyperparameters and the difference of omegas, we also see the difference, and HDI doesn't overlap with zero. Let's see, how many of the samples we took are different from zero. # In[70]: np.mean(wdiff_sample<0) # Around 99 percent, looks quite different to me. Let's also look at odds ratio, which is loosely an equivalent of effect size and tells us how likely one group compared to other will give extreme answer. # In[71]: oddsratio = np.mean(trace[5000:].get_values('oddsratio',combine=True),axis=1) # In[72]: fig = plt.figure(figsize=(8, 4)); plt.hist(oddsratio, label="posterior of $Odds ratio$,\n mean={0}".format(np.mean(oddsratio)), normed=True); plt.title(r"""Posterior distribution of the group-wise odds ratio between RU and ENG"""); plt.legend(loc='upper right'); # Let's do the same thing with omegas. # In[73]: w_oddsratio = (w2_sample/(1-w2_sample))/(w1_sample/(1-w1_sample)) # In[74]: fig = plt.figure(figsize=(8, 4)); plt.hist(w_oddsratio, label="posterior of $Odds ratio$,\n mean={0}".format(np.mean(w_oddsratio)), normed=True); plt.title(r"""Posterior distribution of the group-wise odds ratio between RU and ENG"""); plt.legend(loc='upper right'); # From this excercise, I can conclude that there is a difference. We see that the odds that English-speaking group will include extreme judgment in their moral dilemmas is about 1.3 times more that in Russian group. # But let's take another step, and do a few checks of our model. #

Posterior predictive check

# In principle, posterior predictive checking goes along the following steps (one small reference: http://www.people.fas.harvard.edu/~plam/teaching/methods/modelcheck/modelcheck_print.pdf): #

Draw multiple candidate parameter values from posterior.
Simulate datasets using these values.
Choose a statistic, that will reflect some discrepancy between model and real data.
Calculate this statistic over simulated and real data.
Compare and decide, if model requires adjustment.

# # Here I add cross validation to it, so that all of these steps are embedded in following: #

Split the dataset into training and validation sets.
Simulate datasets from candidate values drawn from model, trained on training set.
Select statistic and calculate it over the simulated data, and real data from validation set.
Check, how does this statistic differ between real and simulated datasets.

# First, I need to prepare the partitioning, and also decide how many simulations I will do. I decided to have 5 folds for cross validation, and I prepare the indices here, and also to simulate 500 datasets for posterior predictive checks. # In[75]: nsim = 500 # How many simulated datasets to get n_folds = 5 # How many folds for cross validation # Use KFold partitioner from sklearn to create cross-validation indices cv1 = list(KFold(data1.shape[0], n_folds=n_folds)) cv2 = list(KFold(data2.shape[0], n_folds=n_folds)) # I will initialize the containers for outcomes. # In[76]: prop_1 = np.empty((S,n_folds),dtype=float) # Proportion of extreme answers for each story in Russian group prop_2 = np.empty((S,n_folds),dtype=float) # Proportion of extreme answers for each story in English group prop_yrep1 = np.empty((S,n_folds),dtype=float) # Proportions for Russian group in simulated datasets prop_yrep2 = np.empty((S,n_folds),dtype=float) # Proportions for English group in simulated datasets # Now, there is going to be a lot of code, but let's go through the basic idea. #

For each fold, I split data into training and testing samples.
I train the model using training dataset.
I simulate 500 datasets using candidate values from the posterior.
I calculate proportion of extreme answers for each story.

# # Here I borrowed run_ppc function from pymc3, which generates the simulated datasets. # In[77]: from collections import defaultdict def run_ppc(trace, samples=100, model=None): """Generate Posterior Predictive samples from a model given a trace. """ if model is None: model = pymc3.modelcontext(model) ppc = defaultdict(list) for idx in np.random.randint(0, len(trace), samples): param = trace[idx] for obs in model.observed_RVs: ppc[obs.name].append(obs.distribution.random(point=param)) return ppc # In[78]: for fold in range(n_folds): # Datasets are of unequal size, so we can't use single cv object data1_train = data1[cv1[fold][0],:] data1_test = data1[cv1[fold][1],:] data2_train = data2[cv2[fold][0],:] data2_test = data2[cv2[fold][1],:] # Store number of left out samples to know n1 = data1_test.shape[0] n2 = data2_test.shape[0] # Run the chain model, trace = TrainHierarchicalModel(data1_train, data2_train, s_idx=s_idx, nsamp=10000, njobs=1) # Get simulated datasets ppc = run_ppc(trace, samples=nsim, model=model) yrep1 = np.asarray(ppc['y1']) yrep2 = np.asarray(ppc['y2']) # Calculate statistics prop_yrep1[:,fold] = np.array([(yrep1[:,s]==1).mean() for s in range(S)]) prop_yrep2[:,fold] = np.array([(yrep2[:,s]==1).mean() for s in range(S)]) prop_1[:,fold] = np.array([(data1_test[:,s]==1).mean() for s in range(S)]) prop_2[:,fold] = np.array([(data2_test[:,s]==1).mean() for s in range(S)]) # Now that we have all the necessary bits calculated, let's look at the difference between real and simulated. First, lets average over cross validation folds. # In[79]: prop_1 = np.mean(prop_1,1) prop_2 = np.mean(prop_2,1) prop_yrep1 = np.mean(prop_yrep1,1) prop_yrep2 = np.mean(prop_yrep2,1) # I plot the story-wise proportions from real data for English and Russian groups, and on top of that, I will plot the proportions from simulated datasets. # In[80]: fig = plt.figure(figsize=(16, 8)); plt.plot(prop_yrep1,'#ff6969', alpha=1, lw=3., linestyle = '--', label='Simulated proportion of extreme answers, RU') plt.plot(prop_1,'#9a0000', alpha=1, lw=3., label='Proportion of extreme answers, RU') plt.plot(prop_yrep2,'#94b2fe',alpha=1, lw=3., linestyle = '--', label='Simulated proportion of extreme answers, ENG') plt.plot(prop_2,'#4455dd', alpha=1, lw=3., label='Proportion of extreme answers, ENG') plt.legend(loc="upper right"); # On X axis we have different stories (there were 30). Bold blue line represents the proportion of extreme answers for English sample, and bold red line - proportion of extreme answers for Russian sample. The dashed lines correspond to proportions from simulated data. # Here, according to simulated proportions, it seems that the simulated lines follow the real quite well, so no large discrepancy is apparent. # All in all, I don't think we have any significant worrysome discrepancy between our model and data. It would be good to also have some other model, to also show some model comparisons, but for now - this is it.