from IPython.core.display import HTML
styles = open("Style.css").read()
HTML(styles)
import pandas as pd
import numpy as np
import rpy2.robjects as robjects
pi = robjects.r('pi')
pi[0]
3.141592653589793
%load_ext rmagic
Run linear regression in R, print out a summary, and pass the result variable error
back to Python:
%%R -o error
set.seed(10)
y<-c(1:1000)
x1<-c(1:1000)*runif(1000,min=0,max=2)
x2<-(c(1:1000)*runif(1000,min=0,max=2))^2
x3<-log(c(1:1000)*runif(1000,min=0,max=2))
all_data<-data.frame(y,x1,x2,x3)
positions <- sample(nrow(all_data),size=floor((nrow(all_data)/4)*3))
training<- all_data[positions,]
testing<- all_data[-positions,]
lm_fit<-lm(y~x1+x2+x3,data=training)
print(summary(lm_fit))
predictions<-predict(lm_fit,newdata=testing)
error<-sqrt((sum((testing$y-predictions)^2))/nrow(testing))
Call: lm(formula = y ~ x1 + x2 + x3, data = training) Residuals: Min 1Q Median 3Q Max -379.34 -125.71 -29.88 87.58 732.59 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -5.234e+01 2.495e+01 -2.098 0.0363 * x1 2.414e-01 1.589e-02 15.188 <2e-16 *** x2 1.553e-04 9.767e-06 15.900 <2e-16 *** x3 6.404e+01 4.827e+00 13.267 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 166.4 on 746 degrees of freedom Multiple R-squared: 0.6613, Adjusted R-squared: 0.6599 F-statistic: 485.5 on 3 and 746 DF, p-value: < 2.2e-16
print error
[ 169.85333821]
First we create the data in R:
%%R -o training,testing
set.seed(10)
y<-c(1:1000)
x1<-c(1:1000)*runif(1000,min=0,max=2)
x2<-(c(1:1000)*runif(1000,min=0,max=2))^2
x3<-log(c(1:1000)*runif(1000,min=0,max=2))
all_data<-data.frame(y,x1,x2,x3)
positions <- sample(nrow(all_data),size=floor((nrow(all_data)/4)*3))
training<- all_data[positions,]
testing<- all_data[-positions,]
The variables training
and testing
are now available as numpy
array in Python namespace due to the -o
flag in the cell above. We'll create pandas
DataFrame from them:
tr = pd.DataFrame(dict(zip(['y', 'x1', 'x2', 'x3'], training)))
te = pd.DataFrame(dict(zip(['y', 'x1', 'x2', 'x3'], testing)))
tr.head()
x1 | x2 | x3 | y | |
---|---|---|---|---|
0 | 724.861370 | 19728.318211 | 6.430894 | 614 |
1 | 103.074180 | 928.821687 | 5.132348 | 108 |
2 | 606.561051 | 1050676.686068 | 6.564257 | 518 |
3 | 862.674044 | 91504.275820 | 4.670171 | 879 |
4 | 393.014599 | 1134.679888 | 5.721699 | 379 |
Create linear regression model, print a summary:
from statsmodels.formula.api import ols
lm = ols('y ~ x1 + x2 + x3', tr).fit()
lm.summary()
Dep. Variable: | y | R-squared: | 0.661 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.660 |
Method: | Least Squares | F-statistic: | 485.5 |
Date: | Sun, 05 May 2013 | Prob (F-statistic): | 7.53e-175 |
Time: | 12:06:08 | Log-Likelihood: | -4898.0 |
No. Observations: | 750 | AIC: | 9804. |
Df Residuals: | 746 | BIC: | 9823. |
Df Model: | 3 |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | -52.3400 | 24.950 | -2.098 | 0.036 | -101.321 -3.359 |
x1 | 0.2414 | 0.016 | 15.188 | 0.000 | 0.210 0.273 |
x2 | 0.0002 | 9.77e-06 | 15.900 | 0.000 | 0.000 0.000 |
x3 | 64.0431 | 4.827 | 13.267 | 0.000 | 54.567 73.520 |
Omnibus: | 85.222 | Durbin-Watson: | 1.999 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 112.468 |
Skew: | 0.898 | Prob(JB): | 3.78e-25 |
Kurtosis: | 3.609 | Cond. No. | 3.41e+06 |
Predict and compute RMSE:
pred = lm.predict(te)
error = sqrt((sum((te.y - pred)**2)) / len(te))
error
169.85333821453432
First we create data (numpy
array) in Python:
X = np.array([0,1,2,3,4])
Y = np.array([3,5,4,6,7])
We pass them into R using the -i
flag, run linear regression in R, print a summary and plot, output the result back in Python:
%%R -i X,Y -o XYcoef
XYlm = lm(Y~X)
XYcoef = coef(XYlm)
print(summary(XYlm))
par(mfrow=c(2,2))
plot(XYlm)
Call: lm(formula = Y ~ X) Residuals: 1 2 3 4 5 -0.2 0.9 -1.0 0.1 0.2 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.2000 0.6164 5.191 0.0139 * X 0.9000 0.2517 3.576 0.0374 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7958 on 3 degrees of freedom Multiple R-squared: 0.81, Adjusted R-squared: 0.7467 F-statistic: 12.79 on 1 and 3 DF, p-value: 0.03739
We also pass the model coefficients from R as variable XYcoef
:
XYcoef
array([ 3.2, 0.9])