%%html
<link rel="stylesheet" href="static/hyrule.css" type="text/css">
Nearest Neighbors is a seemingly simple enough algorithm, though in reality it is quite different from others learned so far, in terms of classification:
This makes KNN fairly intuitive to understand, though not very explanatory as a learner. Because of this, KNN fairs well as an imputer (what is an imputer?). KNN essentially boils down to:
There are a variety of ways to calculate similar and distance. Below are a few common examples and when you'd expect to use them.
Jaccard Coefficient / Jaccard Distance
The Jaccard Coefficient follows logic of probability:
$J(A,B) = \dfrac{|A \cap B|}{|A \cup B}$
The above would read, "Jaccard's similarity coefficient between sets A and B would be the intersection of A and B divided by the union of A and B."
A union is all unique possible values from each set combined, while an intersection is what they have in common. Consider these two sets defining two houses:
import pandas as pd
houses = pd.DataFrame({
'rooms': [2, 1],
'bathrooms': [1, 1],
'stove': ['gas', 'electric'],
'heat': ['radiator', 'radiator'],
})
houses
bathrooms | heat | rooms | stove | |
---|---|---|---|---|
0 | 1 | radiator | 2 | gas |
1 | 1 | radiator | 1 | electric |
The union of the two houses would contain the set:
bathrooms_1, heat_radiator, rooms_2, rooms_1, stove_gas, stove_electric
And the intersection would contain:
bathrooms_1, heat_radiator
We'd get a Jaccard coefficient of 0.333:
$\dfrac{len(intersection)}{len(union)} = \dfrac{2}{6}$
Jaccard's distance is just a matter of 1 - the similarity coefficient, so the distance of similarity between theses houses would be .667.
Jaccard's distance is great to use as a distance metric for binary and categorical data.
Manhattan Distance
Think of Manhattan distance (or Taxi distance) as the distance to travel between two points given a grid like system (like Manhattan). So, if avenue and street blocks were the same distance, the Manhattan distance from Google (14th Street and 8th Ave) to the General Assembly (21st St and 5th Ave) would be 10 blocks. Manhattan distance is solved as:
$distance = \sum{|n_1 - n_2|}$
where $n$ is each feature. (Sidebar: This is the same concept as L1 regularization)
Euclidean Distance
$distance = \sum{\sqrt{(n_1 - n_2)^2}}$
Euclidean Distance is more similar to, say, helicopter distance: if we had to travel between two points, what would be the straight line taking us from point a to point b? (Sidebar: This is the same concept as L2 regularization)
What to use?
These two distances are summed up as Minkowski distances, Manhattan being Minkowski with a p of 1, and Euclidean being Minkowski with a p of 2.
Both of these distance metrics work well with continuous data, though it is recommended to use Manhattan distance when a distance set of 1 and 3 should be treated the same as a distance set of 2 and 2.
Imputation is fairly intuitive. For the missing data in our dataset, we're going to replace it with values that come from similar records in our dataset that aren't null. To do this we're going to use the NearestNeighbors algorithm. Even imputing, we will use a cross validation set.
from sklearn import cross_validation
df = pd.read_csv("../data/credit-data-post-import.csv")
train, test = cross_validation.train_test_split(df)
train = pd.DataFrame(train, columns=df.columns)
test = pd.DataFrame(test, columns=df.columns)
from sklearn.neighbors import KNeighborsRegressor
income_imputer = KNeighborsRegressor(n_neighbors=1)
# split our data into 2 groups; data containing nulls, and data not containing nulls.
# we'll train on the latter and make 'predictions' on the null data to impute monthly_income
train_w_monthly_income = train[train.monthly_income.isnull()==False]
train_w_null_monthly_income = train[train.monthly_income.isnull()==True]
correlations = train_w_monthly_income.corr()
# which columns are most correlated with monthly income?
# This will be one quick way to help us find our KNN features
monthly_income_correlations = pd.Series(correlations.ix[:, 5])
monthly_income_correlations.order(ascending=False)
monthly_income 1.000000 number_real_estate_loans_or_lines 0.131796 number_of_open_credit_lines_and_loans 0.094435 number_of_dependents 0.066789 age 0.037536 revolving_utilization_of_unsecured_lines 0.007417 number_of_time30-59_days_past_due_not_worse -0.010298 number_of_time60-89_days_past_due_not_worse -0.011455 number_of_times90_days_late -0.013151 serious_dlqin2yrs -0.019518 debt_ratio -0.028590 Name: monthly_income, dtype: float64
cols = ['number_real_estate_loans_or_lines', 'number_of_open_credit_lines_and_loans']
income_imputer = KNeighborsRegressor(n_neighbors=1, p=2)
income_imputer.fit(train_w_monthly_income[cols], train_w_monthly_income.monthly_income)
# fill in the missing data with the imputed values!
train_w_null_monthly_income['monthly_income'] = income_imputer.predict(train_w_null_monthly_income[cols])
test['monthly_income_imputed'] = income_imputer.predict(test[cols])
test[['monthly_income', 'monthly_income_imputed']].head()
/Users/ed/anaconda/lib/python2.7/site-packages/IPython/kernel/__main__.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_index,col_indexer] = value instead
monthly_income | monthly_income_imputed | |
---|---|---|
0 | 1500 | 1903 |
1 | NaN | 5179 |
2 | NaN | 3000 |
3 | 15417 | 300 |
4 | 3000 | 2463 |
income_imputer = KNeighborsRegressor(n_neighbors=2, p=2)
income_imputer.fit(train_w_monthly_income[cols], train_w_monthly_income.monthly_income)
# fill in the missing data with the imputed values!
train_w_null_monthly_income['monthly_income'] = income_imputer.predict(train_w_null_monthly_income[cols])
test['monthly_income_imputed'] = income_imputer.predict(test[cols])
test[['monthly_income', 'monthly_income_imputed']].head()
/Users/ed/anaconda/lib/python2.7/site-packages/IPython/kernel/__main__.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_index,col_indexer] = value instead
monthly_income | monthly_income_imputed | |
---|---|---|
0 | 1500 | 1551.5 |
1 | NaN | 3689.5 |
2 | NaN | 10041.5 |
3 | 15417 | 3066.5 |
4 | 3000 | 4091.5 |
from sklearn import grid_search
imputer = KNeighborsRegressor()
parameters1 = {'p': (1, 2), 'n_neighbors': range(1, 23, 3)}
regr = grid_search.GridSearchCV(imputer, parameters1, scoring='mean_squared_error')
regr.fit(train_w_monthly_income[cols], train_w_monthly_income.monthly_income)
GridSearchCV(cv=None, estimator=KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_neighbors=5, p=2, weights='uniform'), fit_params={}, iid=True, loss_func=None, n_jobs=1, param_grid={'p': (1, 2), 'n_neighbors': [1, 4, 7, 10, 13, 16, 19, 22]}, pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring='mean_squared_error', verbose=0)
parameters2 = {'metric': ['jaccard'], 'n_neighbors': range(1, 23, 3)}
jac = grid_search.GridSearchCV(imputer, parameters2, scoring='mean_squared_error')
jac.fit(train_w_monthly_income[cols], train_w_monthly_income.monthly_income)
GridSearchCV(cv=None, estimator=KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_neighbors=5, p=2, weights='uniform'), fit_params={}, iid=True, loss_func=None, n_jobs=1, param_grid={'n_neighbors': [1, 4, 7, 10, 13, 16, 19, 22], 'metric': ['jaccard']}, pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring='mean_squared_error', verbose=0)
print "----------Minkowski Grid Search-------------"
for score in regr.grid_scores_:
print score
print regr.best_estimator_
print "----------Jaccard Grid Search-------------"
for score in jac.grid_scores_:
print score
print jac.best_estimator_
----------Minkowski Grid Search------------- mean: -215341152.60469, std: 123877304.76588, params: {'n_neighbors': 1, 'p': 1} mean: -216159497.73664, std: 123224883.07285, params: {'n_neighbors': 1, 'p': 2} mean: -594223289.79780, std: 448790328.46063, params: {'n_neighbors': 4, 'p': 1} mean: -594126885.52172, std: 448657740.99690, params: {'n_neighbors': 4, 'p': 2} mean: -323496260.39586, std: 134902360.14763, params: {'n_neighbors': 7, 'p': 1} mean: -323504849.66396, std: 134891185.93772, params: {'n_neighbors': 7, 'p': 2} mean: -255258829.87394, std: 102245766.74486, params: {'n_neighbors': 10, 'p': 1} mean: -255303109.44891, std: 102215572.73639, params: {'n_neighbors': 10, 'p': 2} mean: -228941505.10741, std: 106738250.58592, params: {'n_neighbors': 13, 'p': 1} mean: -228919030.40507, std: 106729247.86480, params: {'n_neighbors': 13, 'p': 2} mean: -216559007.62356, std: 112749380.10789, params: {'n_neighbors': 16, 'p': 1} mean: -216424317.38873, std: 112712944.94071, params: {'n_neighbors': 16, 'p': 2} mean: -208462788.41352, std: 116373753.10405, params: {'n_neighbors': 19, 'p': 1} mean: -208416965.21421, std: 116430228.74066, params: {'n_neighbors': 19, 'p': 2} mean: -203677653.20021, std: 119755628.84419, params: {'n_neighbors': 22, 'p': 1} mean: -203666544.54643, std: 119768973.85313, params: {'n_neighbors': 22, 'p': 2} KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_neighbors=22, p=2, weights='uniform') ----------Jaccard Grid Search------------- mean: -196665505.67847, std: 127828994.82974, params: {'n_neighbors': 1, 'metric': 'jaccard'} mean: -191471579.79710, std: 128582668.45339, params: {'n_neighbors': 4, 'metric': 'jaccard'} mean: -192829528.05021, std: 129679645.79434, params: {'n_neighbors': 7, 'metric': 'jaccard'} mean: -193372270.53983, std: 128655873.41015, params: {'n_neighbors': 10, 'metric': 'jaccard'} mean: -194563923.90732, std: 129599563.31278, params: {'n_neighbors': 13, 'metric': 'jaccard'} mean: -195589120.11682, std: 128214527.09193, params: {'n_neighbors': 16, 'metric': 'jaccard'} mean: -195384714.28321, std: 128155966.24802, params: {'n_neighbors': 19, 'metric': 'jaccard'} mean: -195604247.62062, std: 128267248.12180, params: {'n_neighbors': 22, 'metric': 'jaccard'} KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='jaccard', metric_params=None, n_neighbors=4, p=2, weights='uniform')
Read through the results of the grid search above and take a few minutes to answer the following questions:
Let's walk through the same example, but using Neighbors in its more common form as a classifier. In this case, we'll classify the iris flowers.
from sklearn import grid_search, datasets
from sklearn.neighbors import KNeighborsClassifier
iris = datasets.load_iris()
parameters = {'p': [2], 'n_neighbors':range(1, 100)}
knn = KNeighborsClassifier()
clf = grid_search.GridSearchCV(knn, parameters, scoring='accuracy')
clf.fit(iris.data, iris.target)
GridSearchCV(cv=None, estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_neighbors=5, p=2, weights='uniform'), fit_params={}, iid=True, loss_func=None, n_jobs=1, param_grid={'p': [2], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]}, pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring='accuracy', verbose=0)
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
scores = [i.mean_validation_score for i in clf.grid_scores_]
plt.plot(scores)
[<matplotlib.lines.Line2D at 0x10e1d73d0>]
Take a few minutes to answer these questions:
import numpy as np
def plot_estimator(estimator, X, y, ax, pos):
estimator.fit(X,y)
x_min, x_max = X[:,0].min() - 1, X[:,0].max() + .1
y_min, y_max = X[:,1].min() - 1, X[:,1].max() + .1
xx, yy =np.meshgrid(np.linspace(x_min, x_max, 100),
np.linspace(y_min, y_max, 100))
# perform classification on our samples
Z = estimator.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape (xx.shape)
ax[pos].pcolormesh(xx,yy, Z, cmap=plt.cm.Paired)
# Lets plot our sample points
ax[pos].scatter(X[:,0], X[:,1], c=y)
fig, axes = plt.subplots(1, 5, figsize=[12.5, 2.5])
for e, k in enumerate([1, 5, 40, 60, 100]):
plot_estimator(KNeighborsClassifier(n_neighbors=k), iris.data[:,:2], iris.target, axes, e)
KNN is another example of supervised learning: machine learning techniques used when we already know what the target variable looks like, and we want to either classify or regress to solve for y. What happens if we don't know?
Primary unsupervised learning principles:
Dimensionality Reduction A technique to reducing the number of features that exist , most often maintaining the variance of the data.
Clustering A technique of creating new structure, often using distance or similarity as a guideline.
Keep in mind the similarities here; clustering is unsupervised, classification is supervised (we already know the answer)
we need to divide our data into groups. We believe there is some "hidden" statistic in our data that could be defined by clusters:
A simpler clustering example would be hierarchical clustering. Consider these examples:
Check out the beers.txt
file in the data section of the repo (it is space delimited). Take a few minutes to think about the following:
Review your final project with your seat partner. Identify points where it may be interesting to identify clusters in your dataset. Keep in mind, there equally could be no advantage to clustering, so don't go out of your way! Walk through the same process as above.
Hierarchical clustering does not predefine clusters, it simply "walks" from clusters = number of observations to clusters = 1. K Means, instead, solves for a given or expected K clusters. This means that just like KNN, we need to solve for K.
K Means is an iterative formula, meaning it will repeat steps until the solution is found (or converged):
Since K Means for sklearn requires using the Euclidean distance (Minkowsi p=2), K Means will work best with numerical data.
The rule of thumb: K = half of the number of observations. This, if anything, is a good starting point, but not the end all solution.
One common approach is using explained variance: how many clusters are required to explain the variance of the data set? We can use these rules to help define that:
This technique is called the elbow rule and applies to many concepts of unsupervised learning! Let's apply it to the iris data set:
from sklearn import cluster
from __future__ import division
sepal_data = iris.data[:, :2]
krange = range(1, int(sepal_data.shape[0] / 2))
sum_squares = [cluster.KMeans(n_clusters=k).fit(sepal_data).inertia_ for k in krange]
variance_explained = [1.0 - (s / sum(sum_squares)) for s in sum_squares]
plt.figure()
plt.plot(range(1, 75), variance_explained)
plt.ylabel('% of explained variance')
plt.xlabel('clusters K')
plt.xlim([1, 75])
(1, 75)
It looks like we bend much earlier than 75, so let's zoom in a bit...
plt.figure()
plt.plot(range(1, 12), variance_explained[:11])
plt.ylabel('% of explained variance')
plt.xlabel('clusters K')
plt.xlim([1, 11])
(1, 11)
It seems like the right amount of clusters for us would be roughly 3 or 4, at which point after adding clusters seems to explain less and less of the variance. We see 3 clusters defines at least 90% of the variance, which is a great bend point to use.
## -- Actual --
plt.figure(figsize=[4, 4])
plt.scatter(sepal_data[:, 0], sepal_data[:, 1], c=iris.target)
plt.title('Iris Targets')
plt.ylabel(iris.feature_names[1])
plt.xlabel(iris.feature_names[0])
## -- 3 Clusters --
clusters_3 = cluster.KMeans(n_clusters=3).fit(sepal_data)
fig, ax = plt.subplots(1, 2, figsize=[9, 4])
fig.suptitle('K Cluster Targets')
ax[0].scatter(sepal_data[:, 0], sepal_data[:, 1], c=clusters_3.predict(sepal_data))
## -- 4 Clusters --
clusters_4 = cluster.KMeans(n_clusters=4).fit(sepal_data)
ax[1].scatter(sepal_data[:, 0], sepal_data[:, 1], c=clusters_4.predict(sepal_data))
<matplotlib.collections.PathCollection at 0x117695fd0>
The simplest way would be to apply a set of rules. What's a machine learning technique that we can use to define rules?
from sklearn import tree
# fit a classification tree with max_depth=2 on all data
treeclf = tree.DecisionTreeClassifier(max_depth=2, random_state=1)
treeclf.fit(sepal_data, clusters_3.predict(sepal_data))
# create a Graphviz file
with open("iris_clusters.dot", 'wb') as f:
f = tree.export_graphviz(treeclf, out_file=f, feature_names=iris.feature_names[:3])
beer.txt
file in the data section of the repo. It is fairly small.import bs4
import requests
url = 'https://www.google.com/search?q=data+science&start=%s'
l = requests.get(url % 0)
soup = bs4.BeautifulSoup(l.text)
spans = soup.findAll('span', { "class" : "st" })
descriptions = pd.DataFrame([span.text.encode('ascii', 'replace') for span in spans])
print len(descriptions)
9