Regression with factor variables¶

Key ideas:

Outcome is still quantitative
Covariate(s) are factor variables (Categorical in pandas)
Fitting lines = fitting means
Want to evaluate contribution of all factor levels at once

In [1]:

# Example: movie ratings

import pandas as pd

movies = pd.read_csv('http://www.rossmanchance.com/iscam2/data/movies03RT.txt', sep='\t', 
                      names=['X', 'score', 'rating', 'genre', 'box_office', 'running_time'], skiprows=1)

In [2]:

movies.head()

Out[2]:

	X	score	rating	genre	box_office	running_time
0	2 Fast 2 Furious	48.9	PG-13	action/adventure	127.146	107
1	28 Days Later	78.2	R	horror	45.065	113
2	A Guy Thing	39.5	PG-13	rom comedy	15.545	101
3	A Man Apart	42.9	R	action/adventure	26.248	110
4	A Mighty Wind	79.9	PG-13	comedy	17.781	91

In [3]:

def jitter(series, factor):
    z = float(series.max()) - float(series.min())
    a = float(factor) * z / 50.
    return map(lambda x: x + np.random.uniform(-a, a), series)

Rotten tomatoes score vs rating (average scores per rating are marked with red squares):

In [4]:

r = pd.Categorical.from_array(movies['rating'])

plot(jitter(r.labels, 2.), movies['score'], 'ob')
xticks(np.unique(r.labels), r.levels)
xlabel('rating')
ylabel('score')

meanRatings = movies.groupby('rating')['score'].mean()

plot(range(4), meanRatings, 'sr', markersize=10);

This is written:

$S_i = b_0 + b_1 1(Ra_i = "PG") + b_2 1(Ra_i = "PG - 13") + b_3 1(Ra_i = "R") + e_i$

$1(Ra_i = "PG")$ is a logical value that is one if the movie rating is "PG" and zero otherwise.

Average values

$b_0$ = average of G movies

$b_0 + b_1$ = average of PG movies

$b_0 + b_2$ = average of PG-13 movies

$b_0 + b_3$ = average of R movies

To do regression with factor variables in Python we use statsmodels:

In [5]:

from statsmodels.formula.api import ols

lm1 = ols('score ~ rating', movies).fit()

lm1.summary()

Out[5]:

OLS Regression Results
Dep. Variable:	score	R-squared:	0.020
Model:	OLS	Adj. R-squared:	-0.002
Method:	Least Squares	F-statistic:	0.9182
Date:	Sat, 06 Apr 2013	Prob (F-statistic):	0.434
Time:	17:57:22	Log-Likelihood:	-569.90
No. Observations:	140	AIC:	1148.
Df Residuals:	136	BIC:	1160.
Df Model:	3

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	67.6500	7.193	9.405	0.000	53.425 81.875
rating[T.PG]	-12.5929	7.849	-1.604	0.111	-28.114 2.928
rating[T.PG-13]	-11.8146	7.411	-1.594	0.113	-26.471 2.842
rating[T.R]	-12.0200	7.476	-1.608	0.110	-26.803 2.763

Omnibus:	3.313	Durbin-Watson:	2.100
Prob(Omnibus):	0.191	Jarque-Bera (JB):	3.327
Skew:	0.367	Prob(JB):	0.189
Kurtosis:	2.821	Cond. No.	14.0

If we plot the fitted values:

In [9]:

plot(jitter(r.labels, 2.), movies['score'], 'ob')
xticks(np.unique(r.labels), r.levels)
xlabel('rating')
ylabel('score')

plot(range(4), lm1.params[0] + np.append(0, lm1.params[1:]), 'sr', markersize=10);

Question 1: What is the difference in average rating between G and R movies?

Answer: $b_0 + b_3 - b_0 = b3$

In [10]:

lm1.summary()

Out[10]:

OLS Regression Results
Dep. Variable:	score	R-squared:	0.020
Model:	OLS	Adj. R-squared:	-0.002
Method:	Least Squares	F-statistic:	0.9182
Date:	Sat, 06 Apr 2013	Prob (F-statistic):	0.434
Time:	18:06:24	Log-Likelihood:	-569.90
No. Observations:	140	AIC:	1148.
Df Residuals:	136	BIC:	1160.
Df Model:	3

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	67.6500	7.193	9.405	0.000	53.425 81.875
rating[T.PG]	-12.5929	7.849	-1.604	0.111	-28.114 2.928
rating[T.PG-13]	-11.8146	7.411	-1.594	0.113	-26.471 2.842
rating[T.R]	-12.0200	7.476	-1.608	0.110	-26.803 2.763

Omnibus:	3.313	Durbin-Watson:	2.100
Prob(Omnibus):	0.191	Jarque-Bera (JB):	3.327
Skew:	0.367	Prob(JB):	0.189
Kurtosis:	2.821	Cond. No.	14.0

In [11]:

lm1.conf_int()

Out[11]:

	0	1
Intercept	53.424778	81.875222
rating[T.PG]	-28.113847	2.928133
rating[T.PG-13]	-26.471002	2.841772
rating[T.R]	-26.803285	2.763285

Question 2: What is the difference in average rating between PG-13 and R movies?

Answer: $b_0 + b_2 - (b_0 + b_3) = b_2 - b_3$

We could rewrite our model

$S_i = b_0 + b_1 1(Ra_i = "G") + b_2 1(Ra_i = "PG") + b_3 1(Ra_i = "PG - 13") + e_i$

Average values

$b_0$ = average of R movies

$b_0 + b_1$ = average of G movies

$b_0 + b_2$ = average of PG movies

$b_0 + b_3$ = average of PG-13 movies

What is the difference in average rating between PG-13 and R movies?

$b_0 + b_3 - b_0 = b3$

In [18]:

from patsy.contrasts import Treatment

lm2 = ols('score ~ C(rating, Treatment(reference="R"))', movies).fit()
lm2.summary()

Out[18]:

OLS Regression Results
Dep. Variable:	score	R-squared:	0.020
Model:	OLS	Adj. R-squared:	-0.002
Method:	Least Squares	F-statistic:	0.9182
Date:	Sat, 06 Apr 2013	Prob (F-statistic):	0.434
Time:	19:23:37	Log-Likelihood:	-569.90
No. Observations:	140	AIC:	1148.
Df Residuals:	136	BIC:	1160.
Df Model:	3

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	55.6300	2.035	27.342	0.000	51.606 59.654
C(rating, Treatment(reference="R"))[T.G]	12.0200	7.476	1.608	0.110	-2.763 26.803
C(rating, Treatment(reference="R"))[T.PG]	-0.5729	3.741	-0.153	0.879	-7.971 6.825
C(rating, Treatment(reference="R"))[T.PG-13]	0.2054	2.706	0.076	0.940	-5.146 5.557

Omnibus:	3.313	Durbin-Watson:	2.100
Prob(Omnibus):	0.191	Jarque-Bera (JB):	3.327
Skew:	0.367	Prob(JB):	0.189
Kurtosis:	2.821	Cond. No.	7.05

In [20]:

lm2.conf_int()

Out[20]:

	0	1
Intercept	51.606500	59.653500
C(rating, Treatment(reference="R"))[T.G]	-2.763285	26.803285
C(rating, Treatment(reference="R"))[T.PG]	-7.971015	6.825300
C(rating, Treatment(reference="R"))[T.PG-13]	-5.146371	5.557140

For the model:

$S_i = b_0 + b_1 1(Ra_i = "PG") + b_2 1(Ra_i = "PG - 13") + b_3 1(Ra_i = "R") + e_i$

Question 3: Is there any difference in score between any of the movie ratings?

To answer this we'll use ANOVA:

In [21]:

from statsmodels.stats.anova import anova_lm

anova_lm(lm1)

Out[21]:

	df	sum_sq	mean_sq	F	PR(>F)
rating	3	570.123813	190.041271	0.918184	0.433975
Residual	136	28148.635044	206.975258	NaN	NaN

Our p-value is the probability $PR(>F)$ . As before, in hypothesis-testing parlance, we'll take our null hypothesis, $H_0$ , to be that there is no difference in score between any of the movie ratings, and the alternate hypothesis, $H_a$ , to be the opposite. Here our p-value is greater than some threshold, e.g. 0.05, thus we can't reject the null hypothesis (i.e. we do not have significant evidence against the null hypothesis at the significance level of 0.05).

Another way is to use Tukey's HSD test:

In [22]:

from statsmodels.stats.multicomp import tukeyhsd
import statsmodels.sandbox.stats.multicomp as multi

mc = multi.MultiComparison(movies['score'], r.labels)
res = mc.tukeyhsd() 

print res[0]
print zip(list(r.levels), range(4))

Multiple Comparison of Means - Tukey HSD, FWER=0.05
==============================================
group1 group2 meandiff  lower    upper  reject
----------------------------------------------
  0      1    -12.5929 -33.0089  7.8232 False 
  0      2    -11.8146 -31.0934  7.4642 False 
  0      3     -12.02  -31.4657  7.4257 False 
  1      2     0.7782  -8.6152  10.1717 False 
  1      3     0.5729  -9.1586  10.3043 False 
  2      3    -0.2054   -7.245   6.8342 False 
----------------------------------------------
[('G', 0), ('PG', 1), ('PG-13', 2), ('R', 3)]

In [ ]: