Notebook

In [1]:

%%html
<link rel="stylesheet" href="static/hyrule.css" type="text/css">

Distance, Nearest Neighbors, and KMeans¶

Objectives¶

Review various distance methods commonly used in data science
Learn how distance is used to the application of K Nearest Neighbors
Solving for means in clusters with K means
learning how to interpret results from unsupervised learning

Class Notes¶

K Nearest Neighbors (KNN)¶

Nearest Neighbors is a seemingly simple enough algorithm, though in reality it is quite different from others learned so far, in terms of classification:

Many of the classifiers used so far are discriminative, that is, they draw decision lines to define and separate classes (with the exception of logistic regression). KNN is generative, meaning it learns the space defined for each class, and uses that as its decision line.
Many of the classifiers we've used also use exploratory rules to define classes (with the exception of support vector machines), such as odds and probability. KNN solves for class using distance and similarity.

This makes KNN fairly intuitive to understand, though not very explanatory as a learner. Because of this, KNN fairs well as an imputer (what is an imputer?). KNN essentially boils down to:

Pick a number for k and an observation features X and target y.
Find the k closest data points to the observation.
Whichever is the most common class of the neighbors in the target assignment.

Distance Formulas¶

There are a variety of ways to calculate similar and distance. Below are a few common examples and when you'd expect to use them.

Jaccard Coefficient / Jaccard Distance

The Jaccard Coefficient follows logic of probability:

$J(A,B) = \dfrac{|A \cap B|}{|A \cup B}$

The above would read, "Jaccard's similarity coefficient between sets A and B would be the intersection of A and B divided by the union of A and B."

A union is all unique possible values from each set combined, while an intersection is what they have in common. Consider these two sets defining two houses:

In [2]:

import pandas as pd

houses = pd.DataFrame({
        'rooms': [2, 1],
        'bathrooms': [1, 1],
        'stove': ['gas', 'electric'],
        'heat': ['radiator', 'radiator'],
})

houses

Out[2]:

	bathrooms	heat	rooms	stove
0	1	radiator	2	gas
1	1	radiator	1	electric

The union of the two houses would contain the set:

bathrooms_1, heat_radiator, rooms_2, rooms_1, stove_gas, stove_electric

And the intersection would contain:

bathrooms_1, heat_radiator

We'd get a Jaccard coefficient of 0.333:

$\dfrac{len(intersection)}{len(union)} = \dfrac{2}{6}$

Jaccard's distance is just a matter of 1 - the similarity coefficient, so the distance of similarity between theses houses would be .667.

Jaccard's distance is great to use as a distance metric for binary and categorical data.

Minkowski distance (Manhattan, Euclidian)¶

Manhattan Distance

Think of Manhattan distance (or Taxi distance) as the distance to travel between two points given a grid like system (like Manhattan). So, if avenue and street blocks were the same distance, the Manhattan distance from Google (14th Street and 8th Ave) to the General Assembly (21st St and 5th Ave) would be 10 blocks. Manhattan distance is solved as:

$distance = \sum{|n_1 - n_2|}$

where $n$ is each feature. (Sidebar: This is the same concept as L1 regularization)

Euclidean Distance

$distance = \sum{\sqrt{(n_1 - n_2)^2}}$

Euclidean Distance is more similar to, say, helicopter distance: if we had to travel between two points, what would be the straight line taking us from point a to point b? (Sidebar: This is the same concept as L2 regularization)

What to use?

These two distances are summed up as Minkowski distances, Manhattan being Minkowski with a p of 1, and Euclidean being Minkowski with a p of 2.

Both of these distance metrics work well with continuous data, though it is recommended to use Manhattan distance when a distance set of 1 and 3 should be treated the same as a distance set of 2 and 2.

Imputation with KNN¶

Imputation is fairly intuitive. For the missing data in our dataset, we're going to replace it with values that come from similar records in our dataset that aren't null. To do this we're going to use the NearestNeighbors algorithm. Even imputing, we will use a cross validation set.

In [3]:

from sklearn import cross_validation
df = pd.read_csv("../data/credit-data-post-import.csv")

train, test = cross_validation.train_test_split(df)
train = pd.DataFrame(train, columns=df.columns)
test = pd.DataFrame(test, columns=df.columns)

In [4]:

from sklearn.neighbors import KNeighborsRegressor

income_imputer = KNeighborsRegressor(n_neighbors=1)

# split our data into 2 groups; data containing nulls, and data not containing nulls.
# we'll train on the latter and make 'predictions' on the null data to impute monthly_income
train_w_monthly_income = train[train.monthly_income.isnull()==False]
train_w_null_monthly_income = train[train.monthly_income.isnull()==True]

In [5]:

correlations = train_w_monthly_income.corr()

# which columns are most correlated with monthly income?
# This will be one quick way to help us find our KNN features

monthly_income_correlations = pd.Series(correlations.ix[:, 5])
monthly_income_correlations.order(ascending=False)

Out[5]:

monthly_income                                 1.000000
number_real_estate_loans_or_lines              0.131796
number_of_open_credit_lines_and_loans          0.094435
number_of_dependents                           0.066789
age                                            0.037536
revolving_utilization_of_unsecured_lines       0.007417
number_of_time30-59_days_past_due_not_worse   -0.010298
number_of_time60-89_days_past_due_not_worse   -0.011455
number_of_times90_days_late                   -0.013151
serious_dlqin2yrs                             -0.019518
debt_ratio                                    -0.028590
Name: monthly_income, dtype: float64

In [6]:

cols = ['number_real_estate_loans_or_lines', 'number_of_open_credit_lines_and_loans']

income_imputer = KNeighborsRegressor(n_neighbors=1, p=2)
income_imputer.fit(train_w_monthly_income[cols], train_w_monthly_income.monthly_income)

# fill in the missing data with the imputed values!
train_w_null_monthly_income['monthly_income'] = income_imputer.predict(train_w_null_monthly_income[cols])

test['monthly_income_imputed'] = income_imputer.predict(test[cols])
test[['monthly_income', 'monthly_income_imputed']].head()

/Users/ed/anaconda/lib/python2.7/site-packages/IPython/kernel/__main__.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead

Out[6]:

	monthly_income	monthly_income_imputed
0	1500	1903
1	NaN	5179
2	NaN	3000
3	15417	300
4	3000	2463

In [7]:

income_imputer = KNeighborsRegressor(n_neighbors=2, p=2)
income_imputer.fit(train_w_monthly_income[cols], train_w_monthly_income.monthly_income)

# fill in the missing data with the imputed values!
train_w_null_monthly_income['monthly_income'] = income_imputer.predict(train_w_null_monthly_income[cols])

test['monthly_income_imputed'] = income_imputer.predict(test[cols])
test[['monthly_income', 'monthly_income_imputed']].head()

/Users/ed/anaconda/lib/python2.7/site-packages/IPython/kernel/__main__.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead

Out[7]:

	monthly_income	monthly_income_imputed
0	1500	1551.5
1	NaN	3689.5
2	NaN	10041.5
3	15417	3066.5
4	3000	4091.5

In [8]:

from sklearn import grid_search

imputer = KNeighborsRegressor()
parameters1 = {'p': (1, 2), 'n_neighbors': range(1, 23, 3)}

regr = grid_search.GridSearchCV(imputer, parameters1, scoring='mean_squared_error')
regr.fit(train_w_monthly_income[cols], train_w_monthly_income.monthly_income)

Out[8]:

GridSearchCV(cv=None,
       estimator=KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_neighbors=5, p=2, weights='uniform'),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'p': (1, 2), 'n_neighbors': [1, 4, 7, 10, 13, 16, 19, 22]},
       pre_dispatch='2*n_jobs', refit=True, score_func=None,
       scoring='mean_squared_error', verbose=0)

In [9]:

parameters2 = {'metric': ['jaccard'], 'n_neighbors': range(1, 23, 3)}
jac = grid_search.GridSearchCV(imputer, parameters2, scoring='mean_squared_error')
jac.fit(train_w_monthly_income[cols], train_w_monthly_income.monthly_income)

Out[9]:

GridSearchCV(cv=None,
       estimator=KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_neighbors=5, p=2, weights='uniform'),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'n_neighbors': [1, 4, 7, 10, 13, 16, 19, 22], 'metric': ['jaccard']},
       pre_dispatch='2*n_jobs', refit=True, score_func=None,
       scoring='mean_squared_error', verbose=0)

In [10]:

print "----------Minkowski Grid Search-------------"
for score in regr.grid_scores_:
    print score

print regr.best_estimator_ 

print "----------Jaccard Grid Search-------------"
for score in jac.grid_scores_:
    print score

print jac.best_estimator_ 

----------Minkowski Grid Search-------------
mean: -215341152.60469, std: 123877304.76588, params: {'n_neighbors': 1, 'p': 1}
mean: -216159497.73664, std: 123224883.07285, params: {'n_neighbors': 1, 'p': 2}
mean: -594223289.79780, std: 448790328.46063, params: {'n_neighbors': 4, 'p': 1}
mean: -594126885.52172, std: 448657740.99690, params: {'n_neighbors': 4, 'p': 2}
mean: -323496260.39586, std: 134902360.14763, params: {'n_neighbors': 7, 'p': 1}
mean: -323504849.66396, std: 134891185.93772, params: {'n_neighbors': 7, 'p': 2}
mean: -255258829.87394, std: 102245766.74486, params: {'n_neighbors': 10, 'p': 1}
mean: -255303109.44891, std: 102215572.73639, params: {'n_neighbors': 10, 'p': 2}
mean: -228941505.10741, std: 106738250.58592, params: {'n_neighbors': 13, 'p': 1}
mean: -228919030.40507, std: 106729247.86480, params: {'n_neighbors': 13, 'p': 2}
mean: -216559007.62356, std: 112749380.10789, params: {'n_neighbors': 16, 'p': 1}
mean: -216424317.38873, std: 112712944.94071, params: {'n_neighbors': 16, 'p': 2}
mean: -208462788.41352, std: 116373753.10405, params: {'n_neighbors': 19, 'p': 1}
mean: -208416965.21421, std: 116430228.74066, params: {'n_neighbors': 19, 'p': 2}
mean: -203677653.20021, std: 119755628.84419, params: {'n_neighbors': 22, 'p': 1}
mean: -203666544.54643, std: 119768973.85313, params: {'n_neighbors': 22, 'p': 2}
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_neighbors=22, p=2, weights='uniform')
----------Jaccard Grid Search-------------
mean: -196665505.67847, std: 127828994.82974, params: {'n_neighbors': 1, 'metric': 'jaccard'}
mean: -191471579.79710, std: 128582668.45339, params: {'n_neighbors': 4, 'metric': 'jaccard'}
mean: -192829528.05021, std: 129679645.79434, params: {'n_neighbors': 7, 'metric': 'jaccard'}
mean: -193372270.53983, std: 128655873.41015, params: {'n_neighbors': 10, 'metric': 'jaccard'}
mean: -194563923.90732, std: 129599563.31278, params: {'n_neighbors': 13, 'metric': 'jaccard'}
mean: -195589120.11682, std: 128214527.09193, params: {'n_neighbors': 16, 'metric': 'jaccard'}
mean: -195384714.28321, std: 128155966.24802, params: {'n_neighbors': 19, 'metric': 'jaccard'}
mean: -195604247.62062, std: 128267248.12180, params: {'n_neighbors': 22, 'metric': 'jaccard'}
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='jaccard',
          metric_params=None, n_neighbors=4, p=2, weights='uniform')

Review: KNN as an imputation tool¶

Read through the results of the grid search above and take a few minutes to answer the following questions:

Which parameter setting performed the best? Which performed the worst?
Do we notice any differences between the values for k neighbors?
Do we notice any differences between Manhattan (p=1) and Euclidean (p=2)?
Do we notice any differences between Minkowski and Jaccard?

KNN as a classifier¶

Let's walk through the same example, but using Neighbors in its more common form as a classifier. In this case, we'll classify the iris flowers.

In [11]:

from sklearn import grid_search, datasets
from sklearn.neighbors import KNeighborsClassifier
iris = datasets.load_iris()
parameters = {'p': [2], 'n_neighbors':range(1, 100)}
knn = KNeighborsClassifier()
clf = grid_search.GridSearchCV(knn, parameters, scoring='accuracy')
clf.fit(iris.data, iris.target)                            

Out[11]:

GridSearchCV(cv=None,
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_neighbors=5, p=2, weights='uniform'),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'p': [2], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]},
       pre_dispatch='2*n_jobs', refit=True, score_func=None,
       scoring='accuracy', verbose=0)

In [12]:

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
scores = [i.mean_validation_score for i in clf.grid_scores_]
plt.plot(scores)

Out[12]:

[<matplotlib.lines.Line2D at 0x10e1d73d0>]

Questions¶

Take a few minutes to answer these questions:

What happens to the accuracy of the algorihm as there are more neighbors used?
Why do you think the accuracy suddently drops around 65 neighbors on the iris data set?
Why again at 100 neighbors?

In [13]:

import numpy as np
def plot_estimator(estimator, X, y, ax, pos):
    estimator.fit(X,y)
    x_min, x_max = X[:,0].min() - 1, X[:,0].max() + .1
    y_min, y_max = X[:,1].min() - 1, X[:,1].max() + .1
    xx, yy =np.meshgrid(np.linspace(x_min, x_max, 100),
                        np.linspace(y_min, y_max, 100))
    
    # perform classification on our samples
    Z = estimator.predict(np.c_[xx.ravel(), yy.ravel()])
                        
    # Put the result into a color plot
    Z = Z.reshape (xx.shape)
    ax[pos].pcolormesh(xx,yy, Z, cmap=plt.cm.Paired)
                        
    # Lets plot our sample points
    ax[pos].scatter(X[:,0], X[:,1], c=y)
    
fig, axes = plt.subplots(1, 5, figsize=[12.5, 2.5])
for e, k in enumerate([1, 5, 40, 60, 100]):
    plot_estimator(KNeighborsClassifier(n_neighbors=k), iris.data[:,:2], iris.target, axes, e)
    

Caveats about K Nearest Neighbors¶

Training Neighbors is a huge cost ($n^2$). This essentially means for every observation, it takes that amount of time squared to compute. So 2 observations = 4 compute, 4 observations = 16 compute. In contrast, a regression would take just as long to compute (roughly) the number of features equal to the number of observations for KNN.
What happens if we train for 4 neighbors and the nearest are 2 of each class? Most algorithms run on a sort, so it just picks the "first" one that hit 2 values, but is that the right assumption to make? What steps could be made to avoid it?
It's arguable that with increased dimensions (more features) points eventually become equidistant. This implies KNN would work better with less features.

K Means¶

What is unsupervised learning?¶

KNN is another example of supervised learning: machine learning techniques used when we already know what the target variable looks like, and we want to either classify or regress to solve for y. What happens if we don't know?

Primary unsupervised learning principles:

Dimensionality Reduction A technique to reducing the number of features that exist , most often maintaining the variance of the data.
Clustering A technique of creating new structure, often using distance or similarity as a guideline.

Keep in mind the similarities here; clustering is unsupervised, classification is supervised (we already know the answer)

When do we use clustering?¶

we need to divide our data into groups. We believe there is some "hidden" statistic in our data that could be defined by clusters:

text clustering: Using additional distance techniques like $tf-idf$ (term frequency-inverse document frequency) and cosine similarity to identify clusters of text and topics
consumer marketing and targeting: Identifying clusters of consumers based purchase history data is an effective strategy on ad placement
Genetics: Gene expressing is typically done with a clustering technique, determining genes that respond similarly across experiments
Recommendation: Everytime you queue up Netflix and you see a playlist called "Movies Bill Murray would watch at 9:00pm," this is an application of clustering.

How does clustering work?¶

A simpler clustering example would be hierarchical clustering. Consider these examples:

A distance or similarity metric is applied to the data.
Each point begins as its own cluster.
Each cluster is paired to its closest match
Repeat until left with one cluster.
Define a distance or dissimilairty point to solve for clusters.

Beer Clustering¶

Check out the beers.txt file in the data section of the repo (it is space delimited). Take a few minutes to think about the following:

What type of clustering do you think could be applied?
What would be the data required to cluster the beers in that way?

Final Project Clustering¶

Review your final project with your seat partner. Identify points where it may be interesting to identify clusters in your dataset. Keep in mind, there equally could be no advantage to clustering, so don't go out of your way! Walk through the same process as above.

How is K Means different?¶

Hierarchical clustering does not predefine clusters, it simply "walks" from clusters = number of observations to clusters = 1. K Means, instead, solves for a given or expected K clusters. This means that just like KNN, we need to solve for K.

How does K Means work?¶

K Means is an iterative formula, meaning it will repeat steps until the solution is found (or converged):

K Means will initially pick K random points in the dimensional space (K is something you will define). These points are called centroids.
All observations then calculate distances to each centroid, the closest centroid will be which cluster it belongs to.
Discarding the centroid, the mean distance point is determined based on observations for each cluster; this "mean" is now the new centroid.
Repeat steps 2 and 3 until the old and new centroid are within the bounds of tolerance (when the centroid movement is less than some amount).

Since K Means for sklearn requires using the Euclidean distance (Minkowsi p=2), K Means will work best with numerical data.

How do we know what K is correct?¶

The rule of thumb: K = half of the number of observations. This, if anything, is a good starting point, but not the end all solution.

One common approach is using explained variance: how many clusters are required to explain the variance of the data set? We can use these rules to help define that:

One cluster will explain the most, on its own.
Clusters k = n (number of observations) will explain 100% of the variance in the data.
We want a set of clusters that offer the most explanation; adding any more would be marginal.

This technique is called the elbow rule and applies to many concepts of unsupervised learning! Let's apply it to the iris data set:

In [14]:

from sklearn import cluster
from __future__ import division

sepal_data = iris.data[:, :2]
krange = range(1, int(sepal_data.shape[0] / 2))
sum_squares = [cluster.KMeans(n_clusters=k).fit(sepal_data).inertia_ for k in krange]

variance_explained = [1.0 - (s / sum(sum_squares)) for s in sum_squares]
plt.figure()
plt.plot(range(1, 75), variance_explained)
plt.ylabel('% of explained variance')
plt.xlabel('clusters K')
plt.xlim([1, 75])

Out[14]:

(1, 75)

It looks like we bend much earlier than 75, so let's zoom in a bit...

In [15]:

plt.figure()
plt.plot(range(1, 12), variance_explained[:11])
plt.ylabel('% of explained variance')
plt.xlabel('clusters K')
plt.xlim([1, 11])

Out[15]:

(1, 11)

It seems like the right amount of clusters for us would be roughly 3 or 4, at which point after adding clusters seems to explain less and less of the variance. We see 3 clusters defines at least 90% of the variance, which is a great bend point to use.

Visualising K Means¶

In [16]:

## -- Actual --
plt.figure(figsize=[4, 4])
plt.scatter(sepal_data[:, 0], sepal_data[:, 1], c=iris.target)
plt.title('Iris Targets')
plt.ylabel(iris.feature_names[1])
plt.xlabel(iris.feature_names[0])

## -- 3 Clusters --
clusters_3 = cluster.KMeans(n_clusters=3).fit(sepal_data)
fig, ax = plt.subplots(1, 2, figsize=[9, 4])
fig.suptitle('K Cluster Targets')
ax[0].scatter(sepal_data[:, 0], sepal_data[:, 1], c=clusters_3.predict(sepal_data))

## -- 4 Clusters --
clusters_4 = cluster.KMeans(n_clusters=4).fit(sepal_data)
ax[1].scatter(sepal_data[:, 0], sepal_data[:, 1], c=clusters_4.predict(sepal_data))

Out[16]:

<matplotlib.collections.PathCollection at 0x117695fd0>

How do we interpret Clusters?¶

The simplest way would be to apply a set of rules. What's a machine learning technique that we can use to define rules?

In [17]:

from sklearn import tree

# fit a classification tree with max_depth=2 on all data
treeclf = tree.DecisionTreeClassifier(max_depth=2, random_state=1)
treeclf.fit(sepal_data, clusters_3.predict(sepal_data))

# create a Graphviz file
with open("iris_clusters.dot", 'wb') as f:
    f = tree.export_graphviz(treeclf, out_file=f, feature_names=iris.feature_names[:3])

On Your Own¶

Work through the beer.txt file in the data section of the repo. It is fairly small.
1. Determine if and what data needs to be scaled in order to solve the clusters.
2. Given the data set, what should be the maximum number of clusters, given the rule of thumb?
3. Use the elbow rule (% variance explained) to find how many clusters of data points we'd expect to explain the datas set.
4. Build the model with k defined.
5. Use a decision tree to build and explain the clusters back.
scraping Use the code below to pull down search results for data science.
1. How many results does one page have?
2. How would you iterate through page of results? Observe how "start" works in the query string.
3. Update to extend the data frame to have at least 1000 results of text.
4. Vectorize the sentences using CountVectorizer
5. consider using tfidf instead (for fun)
6. Use KMeans to identify text clusters. What patterns exist in the search results for data science in Google?

In [18]:

import bs4
import requests

url = 'https://www.google.com/search?q=data+science&start=%s'
l = requests.get(url % 0)
soup = bs4.BeautifulSoup(l.text)
spans = soup.findAll('span', { "class" : "st" })
descriptions = pd.DataFrame([span.text.encode('ascii', 'replace') for span in spans])
print len(descriptions)

Additional Resources¶

A more elementary way to think about kmeans is available from this video
Another walkthrough of kmeans, though it interacts directly with the scipy version. Try to rewrite it using sklearn and see if you get similar results!
More detail on the caveats of kmeans because while the idea and algorithm are relatively simple, the results are crude, and often, not what we want!
read through an additional metric, the silhouette coefficient, to learn more about poor cluster assignment
A more in depth read on clustering from the book Introduction to Data Mining
SKlearn includes a strong overview of the techniques available.
This slide deck reviews the application of clustering to genetics.