Key ideas:
Categorical
in pandas)# Example: movie ratings
import pandas as pd
movies = pd.read_csv('http://www.rossmanchance.com/iscam2/data/movies03RT.txt', sep='\t',
names=['X', 'score', 'rating', 'genre', 'box_office', 'running_time'], skiprows=1)
movies.head()
X | score | rating | genre | box_office | running_time | |
---|---|---|---|---|---|---|
0 | 2 Fast 2 Furious | 48.9 | PG-13 | action/adventure | 127.146 | 107 |
1 | 28 Days Later | 78.2 | R | horror | 45.065 | 113 |
2 | A Guy Thing | 39.5 | PG-13 | rom comedy | 15.545 | 101 |
3 | A Man Apart | 42.9 | R | action/adventure | 26.248 | 110 |
4 | A Mighty Wind | 79.9 | PG-13 | comedy | 17.781 | 91 |
def jitter(series, factor):
z = float(series.max()) - float(series.min())
a = float(factor) * z / 50.
return map(lambda x: x + np.random.uniform(-a, a), series)
Rotten tomatoes score vs rating (average scores per rating are marked with red squares):
r = pd.Categorical.from_array(movies['rating'])
plot(jitter(r.labels, 2.), movies['score'], 'ob')
xticks(np.unique(r.labels), r.levels)
xlabel('rating')
ylabel('score')
meanRatings = movies.groupby('rating')['score'].mean()
plot(range(4), meanRatings, 'sr', markersize=10);
This is written:
Si=b0+b11(Rai="PG")+b21(Rai="PG−13")+b31(Rai="R")+ei1(Rai="PG") is a logical value that is one if the movie rating is "PG" and zero otherwise.
Average values
b0 = average of G movies
b0+b1 = average of PG movies
b0+b2 = average of PG-13 movies
b0+b3 = average of R movies
To do regression with factor variables in Python we use statsmodels
:
from statsmodels.formula.api import ols
lm1 = ols('score ~ rating', movies).fit()
lm1.summary()
Dep. Variable: | score | R-squared: | 0.020 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | -0.002 |
Method: | Least Squares | F-statistic: | 0.9182 |
Date: | Sat, 06 Apr 2013 | Prob (F-statistic): | 0.434 |
Time: | 17:57:22 | Log-Likelihood: | -569.90 |
No. Observations: | 140 | AIC: | 1148. |
Df Residuals: | 136 | BIC: | 1160. |
Df Model: | 3 |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 67.6500 | 7.193 | 9.405 | 0.000 | 53.425 81.875 |
rating[T.PG] | -12.5929 | 7.849 | -1.604 | 0.111 | -28.114 2.928 |
rating[T.PG-13] | -11.8146 | 7.411 | -1.594 | 0.113 | -26.471 2.842 |
rating[T.R] | -12.0200 | 7.476 | -1.608 | 0.110 | -26.803 2.763 |
Omnibus: | 3.313 | Durbin-Watson: | 2.100 |
---|---|---|---|
Prob(Omnibus): | 0.191 | Jarque-Bera (JB): | 3.327 |
Skew: | 0.367 | Prob(JB): | 0.189 |
Kurtosis: | 2.821 | Cond. No. | 14.0 |
If we plot the fitted values:
plot(jitter(r.labels, 2.), movies['score'], 'ob')
xticks(np.unique(r.labels), r.levels)
xlabel('rating')
ylabel('score')
plot(range(4), lm1.params[0] + np.append(0, lm1.params[1:]), 'sr', markersize=10);
Question 1: What is the difference in average rating between G and R movies?
Answer: b0+b3−b0=b3
lm1.summary()
Dep. Variable: | score | R-squared: | 0.020 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | -0.002 |
Method: | Least Squares | F-statistic: | 0.9182 |
Date: | Sat, 06 Apr 2013 | Prob (F-statistic): | 0.434 |
Time: | 18:06:24 | Log-Likelihood: | -569.90 |
No. Observations: | 140 | AIC: | 1148. |
Df Residuals: | 136 | BIC: | 1160. |
Df Model: | 3 |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 67.6500 | 7.193 | 9.405 | 0.000 | 53.425 81.875 |
rating[T.PG] | -12.5929 | 7.849 | -1.604 | 0.111 | -28.114 2.928 |
rating[T.PG-13] | -11.8146 | 7.411 | -1.594 | 0.113 | -26.471 2.842 |
rating[T.R] | -12.0200 | 7.476 | -1.608 | 0.110 | -26.803 2.763 |
Omnibus: | 3.313 | Durbin-Watson: | 2.100 |
---|---|---|---|
Prob(Omnibus): | 0.191 | Jarque-Bera (JB): | 3.327 |
Skew: | 0.367 | Prob(JB): | 0.189 |
Kurtosis: | 2.821 | Cond. No. | 14.0 |
lm1.conf_int()
0 | 1 | |
---|---|---|
Intercept | 53.424778 | 81.875222 |
rating[T.PG] | -28.113847 | 2.928133 |
rating[T.PG-13] | -26.471002 | 2.841772 |
rating[T.R] | -26.803285 | 2.763285 |
Question 2: What is the difference in average rating between PG-13 and R movies?
Answer: b0+b2−(b0+b3)=b2−b3
We could rewrite our model
Si=b0+b11(Rai="G")+b21(Rai="PG")+b31(Rai="PG−13")+eiAverage values
b0 = average of R movies
b0+b1 = average of G movies
b0+b2 = average of PG movies
b0+b3 = average of PG-13 movies
What is the difference in average rating between PG-13 and R movies?
b0+b3−b0=b3
from patsy.contrasts import Treatment
lm2 = ols('score ~ C(rating, Treatment(reference="R"))', movies).fit()
lm2.summary()
Dep. Variable: | score | R-squared: | 0.020 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | -0.002 |
Method: | Least Squares | F-statistic: | 0.9182 |
Date: | Sat, 06 Apr 2013 | Prob (F-statistic): | 0.434 |
Time: | 19:23:37 | Log-Likelihood: | -569.90 |
No. Observations: | 140 | AIC: | 1148. |
Df Residuals: | 136 | BIC: | 1160. |
Df Model: | 3 |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 55.6300 | 2.035 | 27.342 | 0.000 | 51.606 59.654 |
C(rating, Treatment(reference="R"))[T.G] | 12.0200 | 7.476 | 1.608 | 0.110 | -2.763 26.803 |
C(rating, Treatment(reference="R"))[T.PG] | -0.5729 | 3.741 | -0.153 | 0.879 | -7.971 6.825 |
C(rating, Treatment(reference="R"))[T.PG-13] | 0.2054 | 2.706 | 0.076 | 0.940 | -5.146 5.557 |
Omnibus: | 3.313 | Durbin-Watson: | 2.100 |
---|---|---|---|
Prob(Omnibus): | 0.191 | Jarque-Bera (JB): | 3.327 |
Skew: | 0.367 | Prob(JB): | 0.189 |
Kurtosis: | 2.821 | Cond. No. | 7.05 |
lm2.conf_int()
0 | 1 | |
---|---|---|
Intercept | 51.606500 | 59.653500 |
C(rating, Treatment(reference="R"))[T.G] | -2.763285 | 26.803285 |
C(rating, Treatment(reference="R"))[T.PG] | -7.971015 | 6.825300 |
C(rating, Treatment(reference="R"))[T.PG-13] | -5.146371 | 5.557140 |
For the model:
Si=b0+b11(Rai="PG")+b21(Rai="PG−13")+b31(Rai="R")+eiQuestion 3: Is there any difference in score between any of the movie ratings?
To answer this we'll use ANOVA:
from statsmodels.stats.anova import anova_lm
anova_lm(lm1)
df | sum_sq | mean_sq | F | PR(>F) | |
---|---|---|---|---|---|
rating | 3 | 570.123813 | 190.041271 | 0.918184 | 0.433975 |
Residual | 136 | 28148.635044 | 206.975258 | NaN | NaN |
Our p-value is the probability PR(>F). As before, in hypothesis-testing parlance, we'll take our null hypothesis, H0, to be that there is no difference in score between any of the movie ratings, and the alternate hypothesis, Ha, to be the opposite. Here our p-value is greater than some threshold, e.g. 0.05, thus we can't reject the null hypothesis (i.e. we do not have significant evidence against the null hypothesis at the significance level of 0.05).
Another way is to use Tukey's HSD test:
from statsmodels.stats.multicomp import tukeyhsd
import statsmodels.sandbox.stats.multicomp as multi
mc = multi.MultiComparison(movies['score'], r.labels)
res = mc.tukeyhsd()
print res[0]
print zip(list(r.levels), range(4))
Multiple Comparison of Means - Tukey HSD, FWER=0.05 ============================================== group1 group2 meandiff lower upper reject ---------------------------------------------- 0 1 -12.5929 -33.0089 7.8232 False 0 2 -11.8146 -31.0934 7.4642 False 0 3 -12.02 -31.4657 7.4257 False 1 2 0.7782 -8.6152 10.1717 False 1 3 0.5729 -9.1586 10.3043 False 2 3 -0.2054 -7.245 6.8342 False ---------------------------------------------- [('G', 0), ('PG', 1), ('PG-13', 2), ('R', 3)]