#!/usr/bin/env python
# coding: utf-8
# This notebook was put together by [Jake Vanderplas](http://www.vanderplas.com). Source and license info is on [GitHub](https://github.com/jakevdp/sklearn_tutorial/).
# # Introduction to Scikit-Learn: Machine Learning with Python
#
# This session will cover the basics of Scikit-Learn, a popular package containing a collection of tools for machine learning written in Python. See more at http://scikit-learn.org.
# ## Outline
#
# **Main Goal:** To introduce the central concepts of machine learning, and how they can be applied in Python using the Scikit-learn Package.
#
# - Definition of machine learning
# - Data representation in scikit-learn
# - Introduction to the Scikit-learn API
# ## About Scikit-Learn
#
# [Scikit-Learn](http://github.com/scikit-learn/scikit-learn) is a Python package designed to give access to **well-known** machine learning algorithms within Python code, through a **clean, well-thought-out API**. It has been built by hundreds of contributors from around the world, and is used across industry and academia.
#
# Scikit-Learn is built upon Python's [NumPy (Numerical Python)](http://numpy.org) and [SciPy (Scientific Python)](http://scipy.org) libraries, which enable efficient in-core numerical and scientific computation within Python. As such, scikit-learn is not specifically designed for extremely large datasets, though there is [some work](https://github.com/ogrisel/parallel_ml_tutorial) in this area.
#
# For this short introduction, I'm going to stick to questions of in-core processing of small to medium datasets with Scikit-learn.
# ## What is Machine Learning?
#
# In this section we will begin to explore the basic principles of machine learning.
# Machine Learning is about building programs with **tunable parameters** (typically an
# array of floating point values) that are adjusted automatically so as to improve
# their behavior by **adapting to previously seen data.**
#
# Machine Learning can be considered a subfield of **Artificial Intelligence** since those
# algorithms can be seen as building blocks to make computers learn to behave more
# intelligently by somehow **generalizing** rather that just storing and retrieving data items
# like a database system would do.
#
# We'll take a look at two very simple machine learning tasks here.
# The first is a **classification** task: the figure shows a
# collection of two-dimensional data, colored according to two different class
# labels. A classification algorithm may be used to draw a dividing boundary
# between the two clusters of points:
# In[ ]:
get_ipython().run_line_magic('matplotlib', 'inline')
import matplotlib.pyplot as plt
plt.style.use('seaborn')
# In[ ]:
# Import the example plot from the figures directory
from fig_code import plot_sgd_separator
plot_sgd_separator()
# This may seem like a trivial task, but it is a simple version of a very important concept.
# By drawing this separating line, we have learned a model which can **generalize** to new
# data: if you were to drop another point onto the plane which is unlabeled, this algorithm
# could now **predict** whether it's a blue or a red point.
#
# If you'd like to see the source code used to generate this, you can either open the
# code in the `figures` directory, or you can load the code using the `%load` magic command:
# The next simple task we'll look at is a **regression** task: a simple best-fit line
# to a set of data:
# In[ ]:
from fig_code import plot_linear_regression
plot_linear_regression()
# Again, this is an example of fitting a model to data, such that the model can make
# generalizations about new data. The model has been **learned** from the training
# data, and can be used to predict the result of test data:
# here, we might be given an x-value, and the model would
# allow us to predict the y value. Again, this might seem like a trivial problem,
# but it is a basic example of a type of operation that is fundamental to
# machine learning tasks.
# ## Representation of Data in Scikit-learn
#
# Machine learning is about creating models from data: for that reason, we'll start by
# discussing how data can be represented in order to be understood by the computer. Along
# with this, we'll build on our matplotlib examples from the previous section and show some
# examples of how to visualize data.
# Most machine learning algorithms implemented in scikit-learn expect data to be stored in a
# **two-dimensional array or matrix**. The arrays can be
# either ``numpy`` arrays, or in some cases ``scipy.sparse`` matrices.
# The size of the array is expected to be `[n_samples, n_features]`
#
# - **n_samples:** The number of samples: each sample is an item to process (e.g. classify).
# A sample can be a document, a picture, a sound, a video, an astronomical object,
# a row in database or CSV file,
# or whatever you can describe with a fixed set of quantitative traits.
# - **n_features:** The number of features or distinct traits that can be used to describe each
# item in a quantitative manner. Features are generally real-valued, but may be boolean or
# discrete-valued in some cases.
#
# The number of features must be fixed in advance. However it can be very high dimensional
# (e.g. millions of features) with most of them being zeros for a given sample. This is a case
# where `scipy.sparse` matrices can be useful, in that they are
# much more memory-efficient than numpy arrays.
# ![Data Layout](images/data-layout.png)
#
# (Figure from the [Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook))
# ## A Simple Example: the Iris Dataset
#
# As an example of a simple dataset, we're going to take a look at the
# iris data stored by scikit-learn.
# The data consists of measurements of three different species of irises.
# There are three species of iris in the dataset, which we can picture here:
# In[ ]:
from IPython.core.display import Image, display
display(Image(filename='images/iris_setosa.jpg'))
print("Iris Setosa\n")
display(Image(filename='images/iris_versicolor.jpg'))
print("Iris Versicolor\n")
display(Image(filename='images/iris_virginica.jpg'))
print("Iris Virginica")
# ### Quick Question:
#
# **If we want to design an algorithm to recognize iris species, what might the data be?**
#
# Remember: we need a 2D array of size `[n_samples x n_features]`.
#
# - What would the `n_samples` refer to?
#
# - What might the `n_features` refer to?
#
# Remember that there must be a **fixed** number of features for each sample, and feature
# number ``i`` must be a similar kind of quantity for each sample.
# ### Loading the Iris Data with Scikit-Learn
#
# Scikit-learn has a very straightforward set of data on these iris species. The data consist of
# the following:
#
# - Features in the Iris dataset:
#
# 1. sepal length in cm
# 2. sepal width in cm
# 3. petal length in cm
# 4. petal width in cm
#
# - Target classes to predict:
#
# 1. Iris Setosa
# 2. Iris Versicolour
# 3. Iris Virginica
#
# ``scikit-learn`` embeds a copy of the iris CSV file along with a helper function to load it into numpy arrays:
# In[ ]:
from sklearn.datasets import load_iris
iris = load_iris()
# In[ ]:
iris.keys()
# In[ ]:
n_samples, n_features = iris.data.shape
print((n_samples, n_features))
print(iris.data[0])
# In[ ]:
print(iris.data.shape)
print(iris.target.shape)
# In[ ]:
print(iris.target)
# In[ ]:
print(iris.target_names)
# This data is four dimensional, but we can visualize two of the dimensions
# at a time using a simple scatter-plot:
# In[ ]:
import numpy as np
import matplotlib.pyplot as plt
x_index = 0
y_index = 1
# this formatter will label the colorbar with the correct target names
formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])
plt.scatter(iris.data[:, x_index], iris.data[:, y_index],
c=iris.target, cmap=plt.cm.get_cmap('RdYlBu', 3))
plt.colorbar(ticks=[0, 1, 2], format=formatter)
plt.clim(-0.5, 2.5)
plt.xlabel(iris.feature_names[x_index])
plt.ylabel(iris.feature_names[y_index]);
# ### Quick Exercise:
#
# **Change** `x_index` **and** `y_index` **in the above script
# and find a combination of two parameters
# which maximally separate the three classes.**
#
# This exercise is a preview of **dimensionality reduction**, which we'll see later.
# ## Other Available Data
# They come in three flavors:
#
# - **Packaged Data:** these small datasets are packaged with the scikit-learn installation,
# and can be downloaded using the tools in ``sklearn.datasets.load_*``
# - **Downloadable Data:** these larger datasets are available for download, and scikit-learn
# includes tools which streamline this process. These tools can be found in
# ``sklearn.datasets.fetch_*``
# - **Generated Data:** there are several datasets which are generated from models based on a
# random seed. These are available in the ``sklearn.datasets.make_*``
#
# You can explore the available dataset loaders, fetchers, and generators using IPython's
# tab-completion functionality. After importing the ``datasets`` submodule from ``sklearn``,
# type
#
# datasets.load_ + TAB
#
# or
#
# datasets.fetch_ + TAB
#
# or
#
# datasets.make_ + TAB
#
# to see a list of available functions.
# In[ ]:
from sklearn import datasets
# In[ ]:
# Type datasets.fetch_ or datasets.load_ in IPython to see all possibilities
# datasets.fetch_
# In[ ]:
# datasets.load_
# In the next section, we'll use some of these datasets and take a look at the basic principles of machine learning.