# read the iris data into a DataFrame
import pandas as pd
col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
iris = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None, names=col_names)
iris.head()
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
How did we (as humans) predict the species for iris flowers?
More generally:
# allow plots to appear in the notebook
%matplotlib inline
# create a custom colormap
from matplotlib.colors import ListedColormap
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
# map each iris species to a number
iris['species_num'] = iris.species.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2})
# create a scatter plot of PETAL LENGTH versus PETAL WIDTH and color by SPECIES
iris.plot(kind='scatter', x='petal_length', y='petal_width', c='species_num', colormap=cmap_bold)
<matplotlib.axes._subplots.AxesSubplot at 0xc1a7940>
# create a scatter plot of SEPAL LENGTH versus SEPAL WIDTH and color by SPECIES
iris.plot(kind='scatter', x='sepal_length', y='sepal_width', c='species_num', colormap=cmap_bold)
<matplotlib.axes._subplots.AxesSubplot at 0xc2e0320>
Question: What's the "best" value for K in this case?
Answer: The value which produces the most accurate predictions on unseen data. We want to create a model that generalizes!
iris.head()
sepal_length | sepal_width | petal_length | petal_width | species | species_num | |
---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa | 0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa | 0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa | 0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa | 0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa | 0 |
# store feature matrix in "X"
feature_cols = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
X = iris[feature_cols]
# alternative ways to create "X"
X = iris.drop(['species', 'species_num'], axis=1)
X = iris.loc[:, 'sepal_length':'petal_width']
X = iris.iloc[:, 0:4]
# store response vector in "y"
y = iris.species_num
# check X's type
print type(X)
print type(X.values)
<class 'pandas.core.frame.DataFrame'> <type 'numpy.ndarray'>
# check y's type
print type(y)
print type(y.values)
<class 'pandas.core.series.Series'> <type 'numpy.ndarray'>
# check X's shape (n = number of observations, p = number of features)
print X.shape
(150, 4)
# check y's shape (single dimension with length n)
print y.shape
(150L,)
Step 1: Import the class you plan to use
from sklearn.neighbors import KNeighborsClassifier
Step 2: "Instantiate" the "estimator"
knn = KNeighborsClassifier(n_neighbors=1)
print knn
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_neighbors=1, p=2, weights='uniform')
Step 3: Fit the model with data (aka "model training")
knn.fit(X, y)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_neighbors=1, p=2, weights='uniform')
Step 4: Predict the response for a new observation
knn.predict([3, 5, 4, 2])
array([2], dtype=int64)
X_new = [[3, 5, 4, 2], [5, 4, 3, 2]]
knn.predict(X_new)
array([2, 1], dtype=int64)
# instantiate the model (using the value K=5)
knn = KNeighborsClassifier(n_neighbors=5)
# fit the model with data
knn.fit(X, y)
# predict the response for new observations
knn.predict(X_new)
array([1, 1], dtype=int64)
# calculate predicted probabilities of class membership
knn.predict_proba(X_new)
array([[ 0. , 0.8, 0.2], [ 0. , 1. , 0. ]])
# print distances to nearest neighbors (and their identities)
knn.kneighbors([3, 5, 4, 2])
(array([[ 3.19374388, 3.20312348, 3.24037035, 3.35559235, 3.35559235]]), array([[106, 84, 59, 88, 66]], dtype=int64))
Advantages of KNN:
Disadvantages of KNN: