As an example, we will use Iris flower data set introduced by Sir Ronald Fisher.
from sklearn import datasets
datasets.load_iris()
iris = datasets.load_iris()
print iris.DESCR
Iris Plants Database Notes ----- Data Set Characteristics: :Number of Instances: 150 (50 in each of three classes) :Number of Attributes: 4 numeric, predictive attributes and the class :Attribute Information: - sepal length in cm - sepal width in cm - petal length in cm - petal width in cm - class: - Iris-Setosa - Iris-Versicolour - Iris-Virginica :Summary Statistics: ============== ==== ==== ======= ===== ==================== Min Max Mean SD Class Correlation ============== ==== ==== ======= ===== ==================== sepal length: 4.3 7.9 5.84 0.83 0.7826 sepal width: 2.0 4.4 3.05 0.43 -0.4194 petal length: 1.0 6.9 3.76 1.76 0.9490 (high!) petal width: 0.1 2.5 1.20 0.76 0.9565 (high!) ============== ==== ==== ======= ===== ==================== :Missing Attribute Values: None :Class Distribution: 33.3% for each of 3 classes. :Creator: R.A. Fisher :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov) :Date: July, 1988 This is a copy of UCI ML iris datasets. http://archive.ics.uci.edu/ml/datasets/Iris The famous Iris database, first used by Sir R.A Fisher This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. References ---------- - Fisher,R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950). - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis. (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218. - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System Structure and Classification Rule for Recognition in Partially Exposed Environments". IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-2, No. 1, 67-71. - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions on Information Theory, May 1972, 431-433. - See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II conceptual clustering system finds 3 classes in the data. - Many, many more ...
iris.target_names
array(['setosa', 'versicolor', 'virginica'], dtype='|S10')
from IPython.core.display import Image
Image(filename='Iris_setosa.jpg') # from Wikipedia
Image(filename='Iris_versicolor.jpg')
Image(filename='Iris_virginica.jpg')
iris.feature_names
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Image(filename="Petal-sepal.jpg")
iris.data[:3]
array([[ 5.1, 3.5, 1.4, 0.2], [ 4.9, 3. , 1.4, 0.2], [ 4.7, 3.2, 1.3, 0.2]])
iris.target[:3]
array([0, 0, 0])
colors = ('r', '#66ff66', '#4444ff')
X = iris.data
Y = iris.target
i, j = 2, 3
pylab.figure(figsize=(10, 10))
pylab.xlabel(iris.feature_names[i], fontsize = 20)
pylab.ylabel(iris.feature_names[j], fontsize = 20)
pylab.scatter(X[:, i], X[:, j], c=[colors[y] for y in Y], s=50)
<matplotlib.collections.CircleCollection at 0x5c30610>
mean([x for i, x in enumerate(iris.data)
if iris.target[i] == 0], axis=0)
array([ 5.006, 3.418, 1.464, 0.244])
mean([x for i, x in enumerate(iris.data)
if iris.target[i] == 1], axis=0)
array([ 5.936, 2.77 , 4.26 , 1.326])
mean([x for i, x in enumerate(iris.data)
if iris.target[i] == 2], axis=0)
array([ 6.588, 2.974, 5.552, 2.026])
In the example: K-Nearest Neighbours Classifier (voting of a few nearest neighbours).
Other popular:
from sklearn import neighbors
knc = neighbors.KNeighborsClassifier(n_neighbors=5, warn_on_equidistant=False)
step = 5
knc.fit(X[::step], Y[::step])
KNeighborsClassifier(algorithm='auto', leaf_size=30, n_neighbors=5, p=2, warn_on_equidistant=False, weights='uniform')
knc.predict([5.93, 2.77, 4.23, 1.3])
array([1])
knc.predict_proba([5.93, 2.77, 4.23, 1.3])
array([[ 0., 1., 0.]])
Y_pred = knc.predict(X)
i, j = 2, 3
X_ok = array([x for k, x in enumerate(X) if Y[k] == Y_pred[k]])
Y_ok = [y for k, y in enumerate(Y) if y == Y_pred[k]]
X_bad = array([x for k, x in enumerate(X) if Y[k] != Y_pred[k]])
pylab.figure(figsize=(10, 10))
pylab.xlabel(iris.feature_names[i], fontsize = 20)
pylab.ylabel(iris.feature_names[j], fontsize = 20)
pylab.scatter(X_ok[:, i], X_ok[:, j], c=[colors[y] for y in Y_ok], s=50)
pylab.scatter(X[::step, i], X[::step, j], c="w", marker="s", s=15)
pylab.scatter(X_bad[:, i], X_bad[:, j], marker="x", s=50)
<matplotlib.collections.AsteriskPolygonCollection at 0xd7f0a30>
from sklearn import cross_validation
cross_val = cross_validation.KFold(n = len(X), k=10, shuffle=True)
def avg_miss(Y_true, Y_pred):
return (Y_true != Y_pred).sum() / float(len(Y_true))
cross_validation.cross_val_score(knc, X, Y,
cv=cross_val, score_func=avg_miss)
array([ 0.06666667, 0.06666667, 0. , 0.06666667, 0.06666667, 0. , 0. , 0.06666667, 0. , 0. ])
also, gathering programmers wanting to upgrade academia/science
and gifted education