%%html
<link rel="stylesheet" href="static/hyrule.css" type="text/css">
We'll primarily be concerned with two sklearn modules today:
In the effort of simplicity and generality, good, strong features can be described to have the following attributes:
We should aim to keep our models as simple as possible in order to attribute the most gain.
Simple models are much easier to understand as well
There's a number of techniques available in sklearn that automate these processes for us:
sklearn_helper | technique |
---|---|
VarianceThreshold |
Remove features with low variance, based on a tolerance level |
SelectKBest |
Select the best group of correlated features using feature_selection tools. K (as usual) is something you search for and define. |
L1 and Trees |
using fit_transform on any supervised learning algorithm that has it can drop features with low coefficients or importances. |
While SKlearn also has a pipeline
module to further automate this process for you, it is more recommended to explore the data first to get a sense of what you are working with. There's no magic button that says "solve my problem," but if you are interested in automating a model fit (say, a nightly procedue on a deployed model with constantly updated data), then it might be something worth exploring.
For each below we'll work through Iris and notice how it picks out the best features for us. We'll use iris because the data is well scaled (which otherwise requires finetuning) and relatively predictive (we know there are features more predictive than others).
For each code sample below:
.shape
of the new array returned and compare to the original dataset. What columns did it end up keeping, vs removing?import pandas as pd
def make_irisdf():
from sklearn.datasets import load_iris
from pandas import DataFrame
iris = load_iris()
df = DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target
return df
iris = make_irisdf()
from sklearn import feature_selection
VarianceThreshold
¶Goals:
print iris.ix[:,:4].apply(lambda x: x.var())
print iris.ix[:,:4].head()
print feature_selection.VarianceThreshold(threshold=.6).fit_transform(iris.ix[:,:4])[:5]
sepal length (cm) 0.685694 sepal width (cm) 0.188004 petal length (cm) 3.113179 petal width (cm) 0.582414 dtype: float64 sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2 [[ 5.1 1.4] [ 4.9 1.4] [ 4.7 1.3] [ 4.6 1.5] [ 5. 1.4]]
SelectKBest
¶Goals:
math sidebar:
$X^2 = \dfrac{(O-E)^2}{E}$
O = observed frequencies
E = expected frequencies
print iris.ix[:,:4].head()
ftest = feature_selection.SelectKBest(score_func=feature_selection.f_classif, k=3)
print pd.Series(ftest.fit(iris.ix[:,:4], iris['target']).scores_, index=iris.ix[:,:4].columns)
print ftest.fit_transform(iris.ix[:,:4], iris['target'])[:5]
chi = feature_selection.SelectKBest(score_func=feature_selection.chi2, k=3)
print pd.Series(chi.fit(iris.ix[:,:4], iris['target']).scores_, index=iris.ix[:,:4].columns)
print chi.fit_transform(iris.ix[:,:4], iris['target'])[:5]
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2 sepal length (cm) 119.264502 sepal width (cm) 47.364461 petal length (cm) 1179.034328 petal width (cm) 959.324406 dtype: float64 [[ 5.1 1.4 0.2] [ 4.9 1.4 0.2] [ 4.7 1.3 0.2] [ 4.6 1.5 0.2] [ 5. 1.4 0.2]] sepal length (cm) 10.817821 sepal width (cm) 3.594499 petal length (cm) 116.169847 petal width (cm) 67.244828 dtype: float64 [[ 5.1 1.4 0.2] [ 4.9 1.4 0.2] [ 4.7 1.3 0.2] [ 4.6 1.5 0.2] [ 5. 1.4 0.2]]
LogisticRegression
¶Goals:
from sklearn import linear_model as lm
clf = lm.LogisticRegression(penalty='L1', C=0.1)
print iris.ix[:,:4].head()
print pd.DataFrame(clf.fit(iris.ix[:,:4], iris['target']).coef_, columns=iris.ix[:,:4].columns)
print clf.fit_transform(iris.ix[:,:4], iris['target'])[:5]
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2 sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) 0 0.000000 1.124342 -1.344433 0 1 0.000000 -0.386422 0.122768 0 2 -0.987901 0.000000 1.277067 0 [[ 3.5 1.4] [ 3. 1.4] [ 3.2 1.3] [ 3.1 1.5] [ 3.6 1.4]]
DecisionTreeClassifier
¶Goals:
from sklearn import tree
clf = tree.DecisionTreeClassifier(max_depth=4)
print iris.ix[:,:4].head()
print pd.Series(clf.fit(iris.ix[:,:4], iris['target']).feature_importances_, index=iris.ix[:,:4].columns)
print clf.fit_transform(iris.ix[:,:4], iris['target'])[:5]
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2 sepal length (cm) 0.013514 sepal width (cm) 0.000000 petal length (cm) 0.558165 petal width (cm) 0.428322 dtype: float64 [[ 0.2] [ 0.2] [ 0.2] [ 0.2] [ 0.2]]
I don't want to get rid of them!
Then Principal Component Analysis to the rescue!
Manhattan is built on a grid system, with the exception of a couple key points:
If we needed to get from Harold Square to Eataly, what is easier to explain?
Why is that one easier to explain?
PCA is a common technique already used in your day to day:
<img src='img/pca_shakira.png' width='840' / >
Recall that variance is a 1-dimensional metric describing the average distance from the mean. Covariance is a representation of variance with respect to other features.
If variance is a summary of one metric, and a correlation matrix is a square (the relationships of features against each other), what is our expected shape of the covariance matrix?
We can intrepret the covariance matrix as:
Principal Component Analysis is, essentially, the decomposition of the covariance matrix. We are interested in finding the eigenvalues of a square matrix, which for our needs with PCA, represent the amount of variance explained in each principal component.
here be demons
Eigenvalues are defined as found so that:
$Av = \lambda v$
where $A$ is the original square matrix, lambda $\lambda$ is the eigenvalue and $v$ is the eigenvector.
We can rewrite this as:
$(A - \lambda I)v = 0$
where I is the identity matrix of shape A.
This then means, since we are finding a nonzero vector, that the determinant of $A - \lambda I$ must be 0. This solves our eigenvalues.
We can then find the eigenvectors given the shape A and determinants:
$\begin{bmatrix} a & b - \lambda \\ c - \lambda & d \end{bmatrix} * v = 0$
applying each eigenvalue will find your eigenvectors.
So if the eigenvalues from the covariance matrix represent some value of how much variance each component explains (and in order from most to least)... how do we decide how much to keep?
What does this remind us of?
Let's walk through a few examples of decomposition with random data. We'll start with data where we expect the covariance values to be the same as the variance of each feature.
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
random_data = pd.DataFrame({
'x': range(1, 10),
'y': range(1, 10),
'z': range(1, 10),
})
print np.cov(random_data.T, bias=1)
print np.var(random_data.x.T)
[[ 6.66666667 6.66666667 6.66666667] [ 6.66666667 6.66666667 6.66666667] [ 6.66666667 6.66666667 6.66666667]] 6.66666666667
Next, we want to pull out the eigenvalues and eigenvectors:
eig, Q = np.linalg.eig(np.cov(random_data.T, bias=1))
# sort for largest eigenvalue
print eig
print Q
[ 1.77635684e-15 2.00000000e+01 0.00000000e+00] [[-0.81649658 0.57735027 0. ] [ 0.40824829 0.57735027 -0.70710678] [ 0.40824829 0.57735027 0.70710678]]
import seaborn as sns
from __future__ import division
eigsort = np.sort(eig)[::-1]
sns.set_style('white')
plt.figure()
plt.plot(range(1, len(eigsort) + 1), eigsort)
plt.xlabel('principal component')
plt.ylabel('explained variances')
plt.show()
plt.figure()
plt.plot(range(1, len(eigsort) + 1), eigsort / sum(eigsort))
plt.xlabel('principal component')
plt.ylabel('% explained variance')
plt.show()
As we'd expect with the same data: PC1 will explain all of the variance in this data set, since all other features are literally the same. But what does this feature look like?
# ordering eigenvalues and vectors together
ordered = sorted(zip(eig, Q.T), reverse=True)
eig = np.array([_[0] for _ in ordered])
Q = np.column_stack((_[1] for _ in ordered))
# transforming data: We take the dot multiplication of the eigenvectors by the random data
X_transformed = np.dot(Q.T, random_data.T)
print X_transformed[0]
[ 1.73205081 3.46410162 5.19615242 6.92820323 8.66025404 10.39230485 12.12435565 13.85640646 15.58845727]
plt.figure()
plt.plot(random_data.y, random_data.x, '.')
[<matplotlib.lines.Line2D at 0x10a915890>]
plt.figure()
plt.plot(X_transformed[0], '.')
[<matplotlib.lines.Line2D at 0x10aa7b690>]
And for sanity, let's compare our PC1 vs what sklearn would spit out.
from sklearn import decomposition
plt.plot(decomposition.PCA().fit_transform(random_data).T[0], '.')
[<matplotlib.lines.Line2D at 0x10ad3e790>]
What happens when we start introducing noise into our data? Run PCA on our new dataset below and evaluate what changes:
random_data_scattered = pd.DataFrame({
'x': range(1, 100),
'y': range(1, 100),
'z': range(1, 100),
})
random_data_scattered['y'] = random_data_scattered.y.apply(lambda y: y + np.random.normal(scale=10))
random_data_scattered['z'] = random_data_scattered.z.apply(lambda z: z + np.random.normal(scale=20))
print random_data_scattered.head()
plt.plot(random_data_scattered.x, random_data_scattered.y, '.')
x y z 0 1 -1.878842 6.794887 1 2 -8.625472 44.035247 2 3 26.777612 12.021532 3 4 6.347470 -4.309537 4 5 11.734814 5.665158
[<matplotlib.lines.Line2D at 0x10aeba450>]
PCA also allows for kernels, like SVMS, because sometimes we are not looking for a linear solution. The strategy and technique is the same. There's a sample script saved as scripts/kernel_pca.py
, feel free to run the code and experiment.