#!/usr/bin/env python
# coding: utf-8

# ## Linear Mixed Models
# 
# 
# 
# Consider a study that compares the effect of a new drug to a placebo.

# |        |   | Drug      |   |   |   | Placebo   |   |    |
# |--------|---|-----------|---|---|---|-----------|---|----|
# |        |   | Age Group |   |   |   | Age Group |   |    |
# |--------|---|-----------|---|---|---|-----------|---|----|
# |        | A | B         | C |   | A | B         | C |    |
# | Gender |   |           |   |   |   |           |   |    |
# | Male   |   |           |   |   |   |           |   |    |
# | Female |   |           |   |   |   |           |   |    |
# |        |   |           |   |   |   |           |   |    |
# 
# Table 1: Study of effect if a drug over placebo for different age groups for males and females
# 
# 
# |        |   | School X  |   |   |   | School Y  |   |    |
# |--------|---|-----------|---|---|---|-----------|---|----|
# |        |   | Section   |   |   |   | Section   |   |    |
# |--------|---|-----------|---|---|---|-----------|---|----|
# |        | A | B         | C |   | A | B         | C |    |
# | Gender |   |           |   |   |   |           |   |    |
# | Male   |   |           |   |   |   |           |   |    |
# | Female |   |           |   |   |   |           |   |    |
# |        |   |           |   |   |   |           |   |    |
# 
# Table 2: Study of marks scored for different schools over 

# The definition for fixed effects and random-effects is a bit [hazy](http://andrewgelman.com/2005/01/25/why_i_dont_use/), but here we refer to the fixed effects arising due to a factor 
# 
# ### Terminology
# -----------------------------
# **Factor**: Classifications identifying the source of datum. So 'Drug', 'Placebo' and 'Gender' are the three factors in the above study.
# 
# **Levels**: Individual classes of the factors. So 'Male' and 'Female' are levels of 'Gender' factor and 'A', 'B', 'C' are the levels of 'gender' factor. 'Drug' and 'Placebo' are levels of 'Type of drug administered'
# 
# **Crossed Factors**: All levels of 'Age' factor appear for both levels of the 'Type of drug' factorand hence is a 'crossed' factor.
# 
# **Nested Factors**: In the table summarizing school wise grades, the section A under school X is comnpletely independent of section A under school Y. In fact, it does not matter what particular section we refer to, as long as we have 3 sections for both. Since the levels of the 'section' factor appear only once, 'section' is a nested factor within the 'type of school'.
# 
# 
# **Effects**: Extent to which different levels of a factor affect the variable of interest
# 
# In order to define our meaning of fixed and random effects,, we refer to fixed effects as the effects that can be attributed to a finite set of levels of a factor. The random effects arise due to the levels of factor being infinite. So if the performance of two schools is being considered and we also record the student ids along with their gender and scores, the gender in itself will give rise to a fixed effect for there are just two levels: male and female. But, the student ids being levels of the student identifier factor can have infinite levels, because the student can be sampled randomly.
# 
# **Fixed Effects** are simply the effects arising due to the levels of a fixed covariate, a factor whose all levels are of interest such that they represent specific conditions that are used to define specific contrasts
# 
# **Random Effects** arise out of **random factors**, a factor whose levels were randomly sampled from a population of levels. Thus, all possible levels of **random factors** cannot be part of the study and hence only an inference can be made about the entire population of levels given a random sample of the levels
# 

# ### Matrix Specification
# 
# A linear mixed model is described by the following model:
# 
# $$Y= \underbrace{X\beta}_\text{Fixed} + \underbrace{Zu + \epsilon}_\text{Random}$$
# 
# where 
# 
# $n$ = Number of observations
# 
# $p$ = Number of fixed effect covariates
# 
# $q$ = Number of random effect covariates
# 
# $Y$ = response column vector $(n \times 1)$
# 
# $X$ = $n \times p$ model matrix representing the known values of the $p$ fixed-effect covariates 
#       
# $\beta$ = $p \times 1$ fixed-effect parameter column vector
# 
# $Z$ = $n \times q$ model matrix for representing the known values of the $q$ random-effect covariates
# 
# $u$ = $q \times 1$ fixed-effect parameter column vector     
# 
# $\epsilon$ = Residual column vector
# 
# 
# By definition the random-effects are random variables and the $q$ random-effects parameter inthe $u$ column vector follow a multivarite normal distribution:
# 
# $$u \sim N(0,D)$$
# 
# where $D$ is a $q \times q$ variance-covariance matrix
# 
# 
# $\epsilon$ column vector constitutes the residuals associated with the $n$ observations. 
# 
# Residuals and random-effects parameter vector $u$ are assumed to be independent and $\epsilon$ is assumed to be normally distributed:
# 
# $$\epsilon \sim N(0,R)$$ 
# 
# where $R$ is a $n\times n$ variance-covariance matrix
# 
# 
# Our aim is to estimate the column vectors $\beta$ , $u$ (by estimating for the $D$ matrix) and $\epsilon$(by estimating the $R$ matrix)
# 
# It is possible to represent the above model in a marginailised form:
# 
# $$Y= \underbrace{X\beta}_\text{Fixed} + \underbrace{\epsilon^*}_\text{Random}$$
#  
# where $\epsilon^* \sim N(0, ZDZ'+R)$ which helps to relax the constraints on both $D$ and $R$ being positive definite to $ZDZ'+R$ being positive definite.
# 
# and hence $Y \sim N(X\beta, ZDZ'+R)$
# 
# 
# 

# In[13]:


import numpy as np
import statsmodels.api as sm
import pandas
import statsmodels.formula.api as smf


# In[24]:


data = pandas.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/lme4/Penicillin.csv', index_col=0)


# In[14]:


print data.describe(include='all')


# In[16]:


print (data.columns.values)


# In[21]:


print data['sample'].describe()
print data['plate'].describe()


# We take the Penicillin growth example from the 'lme4' manual. Diameter refers to the diameter of growth inhibition in a culturing experiment. There are 24 plates and 6 types of samples, each plate e
# The diameter varies with the plate and sample type.
# Clearly plate and sample are categorical variables with 24 and 6 levels each. The type of plate here is assumed to be sampled randomly(we are not interested in studying what effect a particular plate say A has, but are more interested in studying the effect of plate ). This is an example of crossed random effect. Crossed because each sample is used in each plate. 
# 
# 

# In[39]:


table = data.pivot_table(['plate'],  rows='sample', columns='plate', aggfunc=len)
table


# In[1]:


get_ipython().run_line_magic('load_ext', 'rpy2.ipython')
get_ipython().run_line_magic('R', 'library(lme4)')


# In[7]:


get_ipython().run_line_magic('R', 'str(Penicillin)')


# 

# In[104]:


## Crossed Random Effects
data = pandas.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/lme4/Penicillin.csv', index_col=0)

vcf = {"plate" : "1 + C(plate)", "sample" : "0 + C(sample)"}
model = sm.MixedLM.from_formula('diameter ~ 1', groups="plate", vc_formula=vcf, data=data)
result = model.fit()
result.summary()


# The same example in R:

# In[31]:


get_ipython().run_line_magic('R', 'print (summary(fm <- lmer(diameter ~ 1 + (1|plate) + (1|sample), Penicillin)))')


# ### Dyestuff example
# 
# The dyestuff dataset is a self-sufficient dataset to understand fixed and random effects. The dataset involves batch to batch variability in the strength of dye. We have 6 batches and 5 samples for each batch, totalling to 30 observations. It is easy ti imagine two sources of variability here, the one within-batch and the other one across-batch. We use mixed models to quantify the batch-batch variability:
# 

# In[106]:


data = pandas.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/lme4/Dyestuff.csv', index_col=0)
model = smf.mixedlm('Yield ~ 1', groups="Batch", data=data)
result = model.fit()
result.summary()


# Same example in R:

# In[39]:


get_ipython().run_line_magic('R', 'print( summary(fm <- lmer(Yield ~ 1 + (1|Batch), Dyestuff)))')


# ###Dyestuff2 '
# 
# Same as the dyestuff example, but with batch-batch variability missing:
# 
# 

# In[42]:


get_ipython().run_line_magic('R', 'print(summary(Dyestuff2))')


# In[44]:


get_ipython().run_line_magic('R', 'print( summary(fm <- lmer(Yield ~ 1 + (1|Batch), Dyestuff2)))')


# The variance for batch random effect is 0, thus there is no random effect. It is as good as a linear model:
# 

# In[108]:


get_ipython().run_line_magic('R', 'print (summary(fm <- lm(Yield ~1, Dyestuff2)))')


# In[109]:


data = pandas.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/lme4/Dyestuff2.csv', index_col=0)
vcf = {"batch" : "0 + C(Batch)"}
model = sm.MixedLM.from_formula('Yield ~ 1', groups="Batch",  data=data)
result = model.fit()
result.summary()


# ### Pastes
# The experiment consists of strength of chemical pastes packaged in different casks, at two time points. There are 10 batches, with 3 casks sampled randomly and two tests performed. In total 10x3x2 observations.
# 
# So there are 30 samples with levels 'A:a', 'A:b', 'A:c' and so on, 3 samples for the 10 batches(This sort of naming saves us from the trouble of using different variable names for these 30 levels) The 'a', 'b', 'c' refer to the levels of the cask. But notice that cask 'a' of batch 'A' has no relation to cask 'a' of batch 'B' and so on. 'a', 'b', 'c' differentiate the casks within a batch but have no relation to other batches otherwise.
# 
# 
# This is an example of nested random effects because each level of sample occurs with one and only one level of the batch(compare it with the pivot table we had earlier where all the entries were one) 
# 

# In[86]:


data = pandas.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/lme4/Pastes.csv', index_col=0)

table = data.pivot_table(['batch', 'sample'],  rows='batch', columns='sample', aggfunc=len)
table


# In[110]:


data = pandas.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/lme4/Pastes.csv', index_col=0)
vcf = {"batch" : "0 + C(batch)", "sample" : "0 + C(sample)"}
model = sm.MixedLM.from_formula('strength ~ 1', groups="batch", vc_formula=vcf, data=data)
result = model.fit()
result.summary()


# In R: 

# In[101]:


get_ipython().run_line_magic('R', 'print( summary(fm <- lmer(strength ~ 1 + (1|batch) + (1|sample), Pastes)))')


# In[ ]: