#!/usr/bin/env python # coding: utf-8 # # Locality Preserving Projections in Python # # ``lpproj`` is a Python implementation of Locality Preserving Projections, built to be compatible with scikit-learn. It can be installed with pip; e.g. # # ``` # pip install lpproj # ``` # # For more information, see http://github.com/jakevdp/lpproj # # This notebook contains a very short example showing the use of the code. # In[1]: get_ipython().run_line_magic('matplotlib', 'inline') import numpy as np import matplotlib.pyplot as plt # ## The Data # # We'll use scikit-learn and create some data consisting of blobs in 300 dimensions: # In[2]: from sklearn.datasets import make_blobs X, y = make_blobs(1000, n_features=300, centers=4, cluster_std=8, random_state=42) # ## Random Projections # # If we select a few random two-dimensional projections, we can see that the clusters overlap significantly along any particular "line-of-sight" into the high-dimensional data: # In[3]: fig, ax = plt.subplots(2, 2, figsize=(10, 10)) rand = np.random.RandomState(42) for axi in ax.flat: i, j = rand.randint(X.shape[1], size=2) axi.scatter(X[:, i], X[:, j], c=y) # ## Locality Preserving Projection # # We can find a projection that preserves the locality of the points using the ``LocalityPreservingProjection`` estimator; here we'll project the data into two dimensions: # In[4]: from lpproj import LocalityPreservingProjection lpp = LocalityPreservingProjection(n_components=2) X_2D = lpp.fit_transform(X) # Plotting this projection, we confirm that it has kept nearby points together, as represented by the distinct clusters visible in the projection: # In[5]: plt.scatter(X_2D[:, 0], X_2D[:, 1], c=y) plt.title("Projected from 500->2 dimensions"); # For more information, see the [Locality Preserving Projection website](http://www.cad.zju.edu.cn/home/xiaofeihe/LPP.html) # ## Comparison with PCA # # Of course, there are well-known tools that can do very similar things: for example, a standard Principal Component Analysis projection produces much the same results in this case: # In[6]: from sklearn.decomposition import PCA Xpca = PCA(n_components=2).fit_transform(X) plt.scatter(Xpca[:, 0], Xpca[:, 1], c=y); # It is important to keep in mind, though, that these are two fundamentally different models: PCA finds a linear projection which *maximizes the preserved variance* in the data. LPP finds a linear projection which *maximizes the preserved locality* in the data. # ## Where PCA and LPP Differ # The difference between the two can be made more clear by looking at data with different properties. # One example is data with outliers. # Because PCA is a variance-based method, it is strongly affected by the presence of outliers. # LPP, on the other hand, focuses on preserving local neighborhoods and thus outliers do not have as strong an effect. # # First we'll add ten outliers to our original data: # In[7]: rand = np.random.RandomState(42) Xnoisy = X.copy() Xnoisy[:10] += 1000 * rand.randn(10, X.shape[1]) # Now we can compute the PCA and LPP projections and view the results: # In[8]: Xpca = PCA(n_components=2).fit_transform(Xnoisy) Xlpp = LocalityPreservingProjection(n_components=2).fit_transform(Xnoisy) fig, ax = plt.subplots(1, 2, figsize=(16, 5)) ax[0].scatter(Xlpp[:, 0], Xlpp[:, 1], c=y) ax[0].set_title('LPP with outliers') ax[1].scatter(Xpca[:, 0], Xpca[:, 1], c=y) ax[1].set_title('PCA with outliers'); # In the presence of outliers, the projection found by LPP is much more useful than the projection found by PCA.