Demo of feature selection in scikit-learn and how things can go terribly wrong.
# first a few imports...
import numpy as np
from sklearn import feature_selection
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_validate
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import RFE,RFECV
from sklearn.svm import SVC,LinearSVC
import warnings
warnings.filterwarnings('ignore')
# let's read in the yeast gene expression data that you used in an earlier assignment:
data = np.genfromtxt("data/yeast2.csv", delimiter = ",")
X = data[:,1:]
y = data[:,0]
print(X.shape)
(524, 79)
Let's establish a baseline of how well are we doing with an SVM with a linear kernel:
cv_generator = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
results_orig = cross_validate(LinearSVC(), X, y, cv=cv_generator, scoring='roc_auc', return_train_score=False)
print(np.mean(results_orig['test_score']))
0.998674897119
Not let's add many noisy features and see how that affects classifier performance:
X = np.hstack((X, np.random.randn(len(y), 1000)))
results_noisy = cross_validate(LinearSVC(), X, y, cv=cv_generator, scoring='roc_auc', return_train_score=False)
print(np.mean(results_noisy['test_score']))
0.570410751029
Let's create an instance of RFE
that uses an SVM to define weights for the features (any linear classifier will work):
selector = RFE(LinearSVC(), step=0.1, n_features_to_select=20)
# run feature selection:
selector = selector.fit(X, y)
# check which features got chosen:
print (sum(selector.support_[:79]), sum(selector.support_[79:]))
18 2
The fit
method did not change the data. To do that:
Xt=selector.fit_transform(X,y)
Evaluation of the classification performance of feature selection is actually not trivial. To demonstrate a potential issue let's use a gene expression data set available from the libsvm repository.
from sklearn.datasets import load_svmlight_file
X, y = load_svmlight_file("data/colon-cancer.data")
X.shape
results = cross_validate(LinearSVC(), X, y, cv=cv_generator, scoring='roc_auc', return_train_score=False)
print(np.mean(results['test_score']))
0.87375
Now let's perform feature selection using RFE and evaluate the resulting accuracy:
selector = RFE(LinearSVC(), step=0.1, n_features_to_select=30)
Xt=selector.fit_transform(X,y)
results_wrong = cross_validate(LinearSVC(), Xt, y, cv=cv_generator, scoring='roc_auc', return_train_score=False)
print(np.mean(results_wrong['test_score']))
1.0
Now whenever we get such a fabulous result we need to be concerned. Where did we go wrong?
Here's the correct way to do this:
selector = RFE(LinearSVC(), step=0.1, n_features_to_select=30)
rfe_svm = make_pipeline(selector, LinearSVC())
results_nested = cross_validate(rfe_svm, X, y, cv=cv_generator, scoring='roc_auc', return_train_score=False)
print(np.mean(results_nested['test_score']))
0.88625
This issue was described in the literature in a 2002 paper.