(Defining latex commands: not to be shown...) $$ \newcommand{\norm}[1]{\left \| #1 \right \|} \DeclareMathOperator{\minimize}{minimize} \DeclareMathOperator{\maximize}{maximize} \newcommand{\real}{\mathbb{R}} \newcommand{\blasso}{\beta^{\mathrm{LASSO}}} \newcommand{\bzero}{\beta^0} \newcommand{\bLS}{\hat{\beta}^{\mathrm{LS}}} $$
Let $X \in \real^{n \times p}$ be your training input set, and $Y \in \real^n$ your training output. In a linear model we predict $\hat{y}(x) = x^T \hat{\beta}$ for some input $x \in \real^p$ and some constant $\hat{\beta} \in \real^p$. The so-called $\ell_0$-penalized estimator $\bzero(\lambda)$ tries to find a good $\beta$ with reslatively few non-zero entries. It is defined as: \begin{equation} \bzero(\lambda) := \mathrm{arg \, min}_{\beta} \frac{1}{n} \norm{Y - X \beta}_2^2 + \lambda \norm{\beta}_0 , \end{equation} where $\norm{\beta}_0 := \sharp \{j:\beta_j \neq 0\}$ and $\lambda > 0$. Alternatively, the LASSO estimator $\blasso(\lambda)$ is defined as: \begin{equation} \blasso(\lambda) := \mathrm{arg \, min}_{\beta} \frac{1}{n} \norm{Y - X \beta}_2^2 + \lambda \norm{\beta}_1 , \end{equation} where $\norm{\beta}^2_2 := \sum_{i=1}^p \beta_i^2$, $\norm{\beta}_1 := \sum_{i=1}^p |\beta_i|$ and $\lambda > 0$.
There are in general no analytic formulae for $\bzero(\lambda)$ and $\blasso(\lambda)$. In practice, they are computed by numerical optimization. However, we can compute an analytic expression in the special case of orthogonal design, where \begin{equation} p=n \quad \mathrm{and} \quad \frac{1}{n} X^T X = I_{p \times p} . \end{equation} The goal of this exercise is to compute these formulae.
Let $g_{\mathrm{hard}, \lambda}$ be the function: \begin{equation} g_{\mathrm{hard}, \lambda} : \real \rightarrow \real , \quad z \rightarrow z \, 1_{ \{|z| > \lambda\} } , \end{equation} where $1_{ \{|z| > \lambda\} } = 1$ if $|z| > \lambda$ and $0$ otherwise. Suppose we are in the orthonormal design.
Let $g_{\mathrm{soft}, \lambda}$ be the function: \begin{equation} g_{\mathrm{soft}, \lambda} : \real \rightarrow \real , \quad z \rightarrow \mathrm{sign}(z) \, (|z| - \lambda)_+ . \end{equation} Suppose we are in the orthonormal design.
(This exercise is independent of the first one.)
The goal of this exercise is to apply both ridge regression and LASSO on real data. The data can be found on the homepage of the course, files xtrain_xxxx.csv and xtest_xxxx.csv , and comes from the paper Kemmeren et al.. It consists of the logarithm of the gene expression levels of yeast cells.
More specifically, we consider two datasets corresponding to two different targets: say gene 4710 for the first and gene 3290 for the second. In both cases, we would like to predict the gene expression levels of the target gene, given the gene expression levels of the other 6170 genes. For each dataset, we are given a training set $X_\mathrm{train} \in \real^{140 \, \times \, 6170}$, $y_\mathrm{train} \in \real^{140}$ consisting of 140 yeast cells, and a test set $X_\mathrm{test} \in \real^{20 \, \times \, 6170}$ with 20 cells. To evaluate the performance of our model, we also provide the target values $y_\mathrm{test} \in \real^{20}$ of the test set.
You are asked, to hand in:
Hint: For all the regression tasks, we recommend to use python functions of the type linear_model.xxx(xxx), of the pacakge sklearn. The code asked to hand in should look pretty similar to the code of the section Linear regression with no regularizer. Cross-validation can be easily done using ipython functions of the type linear_model.xxxCV(xxx). Do not hesitate to consult the documentation of these functions. Finally, note that, what we call $\lambda$ corresponds to the parameter alpha of the python functions.
%matplotlib inline
from numpy import *
from scipy import *
from matplotlib import pyplot as plt
from sklearn import linear_model
from pandas import * # for easy import of data
### CHANGE THE PATHS ###
Xtrain = read_csv("./datasets/Xtrain_4710.csv", header = False)
Xtest = read_csv("./datasets/Xtest_4710.csv", header = False)
ytrain = read_csv("./datasets/ytrain_4710.csv", header = False, names = ["gene","y"])
ytest = read_csv("./datasets/ytest_4710.csv", header = False, names = ["gene","y"])
Xtrain = Xtrain.drop(Xtrain.columns[[0]],axis = 1)
ytrain = squeeze(ytrain.drop(ytrain.columns[[0]],axis = 1))
Xtest = Xtest.drop(Xtest.columns[[0]],axis = 1)
ytest = squeeze(ytest.drop(ytest.columns[[0]],axis = 1))
The purpose of this function is to plot the regression weights, the predicted values yhat of y against ytest, the (number of the) gene with the strongest coefficient and the so-called coefficient of determination $R^2$, defined as: \begin{equation} R^2 := 1 - \frac{\sum_{(x_i,y_i) \, \in \mathrm{\, test \ set}} (\hat{y}(x_i) - y_i)^2}{\sum_i (y_i - \overline{y})^2}, \end{equation} where $\overline{y}$ is the emperical mean of the $y_i$'s in the test set.
def printOutput(lm_, Xtest_, ytest_): # lm_ = instance of linear_model.xxx(xxx)
yhat = lm_.predict() ### CHANGE THIS LINE ###
plt.figure(1)
plt.title("Regression Weights")
plt.plot(lm_.coef_.T)
plt.figure(2)
plt.title('yhat vs ytest')
plt.plot(ytest_, yhat, 'ro')
plt.show()
# print 'R2 :', lm_.score() ### CHANGE THIS LINE ###
# print 'Gene with Strongest Coefficient :' , lm_.coef_.xxx ### CHANGE THIS LINE ###
if hasattr(lm_, 'alpha'): print 'Used Lambda :', lm.alpha # if using linear_model.xxx
if hasattr(lm_, 'alpha_'): print 'Lambda_ResultOfCV :', lm.alpha_ # if using linear_model.xxxCV
# print 'Regression Coefs :', lm_.coef_ # Can be printed out, if wanted
# print 'Regression Intercept :', lm_.intercept_ # Can be printed out, if wanted
a. What should happen if you tried to apply linear regression without any regularizer on this data set?
b. Does this happen with the function linear_model.LinearReagression() ?
lm = linear_model.LinearRegression()
lm.fit(Xtrain,ytrain)
# printOutput(lm,Xtest,ytest) ### PROVIDE THE OUTPUT ###
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)
#lm = linear_model.Ridge(xxx) ### CHANGE THIS LINE ###
lm.fit(Xtrain, ytrain);
#printOutput(lm, Xtest, ytest) ### PROVIDE THE OUTPUT ###
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)
# lm = linear_model.Lasso(xxx) ### CHANGE THIS LINE ###
lm.fit(Xtrain, ytrain)
# printOutput(lm, Xtest, ytest) ### PROVIDE THE OUTPUT ###
print 'Regression Coefficient of Gene 5954 : ' ### CHANGE THIS LINE ### ### PROVIDE THE OUTPUT ###
Regression Coefficient of Gene 5954 :