R-Python integration¶

In [1]:

from IPython.core.display import HTML
styles = open("Style.css").read()
HTML(styles)

Out[1]:

In [2]:

import pandas as pd
import numpy as np
import rpy2.robjects as robjects

Running simple command/code in R¶

In [3]:

pi = robjects.r('pi')
pi[0]

Out[3]:

3.141592653589793

Running code using IPython R magic¶

In [4]:

%load_ext rmagic

Create data in R, compute in R, return result to Python¶

Run linear regression in R, print out a summary, and pass the result variable error back to Python:

In [5]:

%%R -o error
set.seed(10)
y<-c(1:1000)
x1<-c(1:1000)*runif(1000,min=0,max=2)
x2<-(c(1:1000)*runif(1000,min=0,max=2))^2
x3<-log(c(1:1000)*runif(1000,min=0,max=2))

all_data<-data.frame(y,x1,x2,x3)
positions <- sample(nrow(all_data),size=floor((nrow(all_data)/4)*3))
training<- all_data[positions,]
testing<- all_data[-positions,]

lm_fit<-lm(y~x1+x2+x3,data=training)
print(summary(lm_fit))

predictions<-predict(lm_fit,newdata=testing)
error<-sqrt((sum((testing$y-predictions)^2))/nrow(testing))

Call:
lm(formula = y ~ x1 + x2 + x3, data = training)

Residuals:
    Min      1Q  Median      3Q     Max 
-379.34 -125.71  -29.88   87.58  732.59 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5.234e+01  2.495e+01  -2.098   0.0363 *  
x1           2.414e-01  1.589e-02  15.188   <2e-16 ***
x2           1.553e-04  9.767e-06  15.900   <2e-16 ***
x3           6.404e+01  4.827e+00  13.267   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 166.4 on 746 degrees of freedom
Multiple R-squared: 0.6613,	Adjusted R-squared: 0.6599 
F-statistic: 485.5 on 3 and 746 DF,  p-value: < 2.2e-16

In [6]:

print error

[ 169.85333821]

Create data in R, compute in Python¶

First we create the data in R:

In [7]:

%%R -o training,testing
set.seed(10)
y<-c(1:1000)
x1<-c(1:1000)*runif(1000,min=0,max=2)
x2<-(c(1:1000)*runif(1000,min=0,max=2))^2
x3<-log(c(1:1000)*runif(1000,min=0,max=2))

all_data<-data.frame(y,x1,x2,x3)
positions <- sample(nrow(all_data),size=floor((nrow(all_data)/4)*3))
training<- all_data[positions,]
testing<- all_data[-positions,]

The variables training and testing are now available as numpy array in Python namespace due to the -o flag in the cell above. We'll create pandas DataFrame from them:

In [8]:

tr = pd.DataFrame(dict(zip(['y', 'x1', 'x2', 'x3'], training)))
te = pd.DataFrame(dict(zip(['y', 'x1', 'x2', 'x3'], testing)))

tr.head()

Out[8]:

	x1	x2	x3	y
0	724.861370	19728.318211	6.430894	614
1	103.074180	928.821687	5.132348	108
2	606.561051	1050676.686068	6.564257	518
3	862.674044	91504.275820	4.670171	879
4	393.014599	1134.679888	5.721699	379

Create linear regression model, print a summary:

In [9]:

from statsmodels.formula.api import ols

lm = ols('y ~ x1 + x2 + x3', tr).fit()
lm.summary()

Out[9]:

OLS Regression Results
Dep. Variable:	y	R-squared:	0.661
Model:	OLS	Adj. R-squared:	0.660
Method:	Least Squares	F-statistic:	485.5
Date:	Sun, 05 May 2013	Prob (F-statistic):	7.53e-175
Time:	12:06:08	Log-Likelihood:	-4898.0
No. Observations:	750	AIC:	9804.
Df Residuals:	746	BIC:	9823.
Df Model:	3

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	-52.3400	24.950	-2.098	0.036	-101.321 -3.359
x1	0.2414	0.016	15.188	0.000	0.210 0.273
x2	0.0002	9.77e-06	15.900	0.000	0.000 0.000
x3	64.0431	4.827	13.267	0.000	54.567 73.520

Omnibus:	85.222	Durbin-Watson:	1.999
Prob(Omnibus):	0.000	Jarque-Bera (JB):	112.468
Skew:	0.898	Prob(JB):	3.78e-25
Kurtosis:	3.609	Cond. No.	3.41e+06

Predict and compute RMSE:

In [10]:

pred = lm.predict(te)

error = sqrt((sum((te.y - pred)**2)) / len(te))
error

Out[10]:

169.85333821453432

Create data in Python, compute in R¶

First we create data (numpy array) in Python:

In [11]:

X = np.array([0,1,2,3,4])
Y = np.array([3,5,4,6,7])

We pass them into R using the -i flag, run linear regression in R, print a summary and plot, output the result back in Python:

In [12]:

%%R -i X,Y -o XYcoef
XYlm = lm(Y~X)
XYcoef = coef(XYlm)
print(summary(XYlm))
par(mfrow=c(2,2))
plot(XYlm)

Call:
lm(formula = Y ~ X)

Residuals:
   1    2    3    4    5 
-0.2  0.9 -1.0  0.1  0.2 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   3.2000     0.6164   5.191   0.0139 *
X             0.9000     0.2517   3.576   0.0374 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.7958 on 3 degrees of freedom
Multiple R-squared:  0.81,	Adjusted R-squared: 0.7467 
F-statistic: 12.79 on 1 and 3 DF,  p-value: 0.03739

We also pass the model coefficients from R as variable XYcoef:

In [13]:

XYcoef

Out[13]:

array([ 3.2,  0.9])

In [ ]: