Here is where we start diving into the field of machine learning.
By the end of this section you will
In addition, we will go over several basic tools within scikit-learn which can be used to accomplish the above tasks.
We saw before the basic definition of Machine Learning:
Machine Learning (ML) is about building programs with tunable parameters (typically an array of floating point values) that are adjusted automatically so as to improve their behavior by adapting to previously seen data.
In most ML applications, the data is in a 2D array of shape [n_samples x n_features]
,
where the number of features is the same for each object, and each feature column refers
to a related piece of information about each sample.
Machine learning can be broken into two broad regimes: supervised learning and unsupervised learning. We’ll introduce these concepts here, and discuss them in more detail below.
In Supervised Learning, we have a dataset consisting of both features and labels. The task is to construct an estimator which is able to predict the label of an object given the set of features. A relatively simple example is predicting the species of iris given a set of measurements of its flower. This is a relatively simple task. Some more complicated examples are:
What these tasks have in common is that there is one or more unknown quantities associated with the object which needs to be determined from other observed quantities.
Supervised learning is further broken down into two categories, classification and regression. In classification, the label is discrete, while in regression, the label is continuous. For example, in astronomy, the task of determining whether an object is a star, a galaxy, or a quasar is a classification problem: the label is from three distinct categories. On the other hand, we might wish to estimate the age of an object based on such observations: this would be a regression problem, because the label (age) is a continuous quantity.
Unsupervised Learning addresses a different sort of problem. Here the data has no labels, and we are interested in finding similarities between the objects in question. In a sense, you can think of unsupervised learning as a means of discovering labels from the data itself. Unsupervised learning comprises tasks such as dimensionality reduction, clustering, and density estimation. For example, in the iris data discussed above, we can used unsupervised methods to determine combinations of the measurements which best display the structure of the data. As we’ll see below, such a projection of the data can be used to visualize the four-dimensional dataset in two dimensions. Some more involved unsupervised learning problems are:
Sometimes the two may even be combined: e.g. Unsupervised learning can be used to find useful features in heterogeneous data, and then these features can be used within a supervised framework.
In scikit-learn, almost all operations are done through an estimator object.
For example, a linear regression estimator can be instantiated as follows:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
print model
Scikit-learn strives to have a uniform interface across all methods,
and we’ll see examples of these below. Given a scikit-learn estimator
object named model
, the following methods are available:
model.fit()
: fit training data. For supervised learning applications,
this accepts two arguments: the data X
and the labels y
(e.g. model.fit(X, y)
).
For unsupervised learning applications, this accepts only a single argument,
the data X
(e.g. model.fit(X)
).model.predict()
: given a trained model, predict the label of a new set of data.
This method accepts one argument, the new data X_new
(e.g. model.predict(X_new)
),
and returns the learned label for each object in the array.model.predict_proba()
: For classification problems, some estimators also provide
this method, which returns the probability that a new observation has each categorical label.
In this case, the label with the highest probability is returned by model.predict()
.model.score()
: for classification or regression problems, most (all?) estimators implement
a score method. Scores are between 0 and 1, with a larger score indicating a better fit.model.transform()
: given an unsupervised model, transform new data into the new basis.
This also accepts one argument X_new
, and returns the new representation of the data based
on the unsupervised model.model.fit_transform()
: some estimators implement this method,
which more efficiently performs a fit and a transform on the same input data.%pylab inline
from figures import plot_supervised_chart, plot_unsupervised_chart
plot_supervised_chart(annotate=False)
plot_supervised_chart(annotate=True)
plot_unsupervised_chart()
(Aside: these charts are generated in matplotlib. You can see the code using the %load magic)
%load figures/ML_flow_chart.py
Recall that previously, we have seen two types of features:
How might we handle other types of features?
Sometimes we have categorical features: for example, imagine the dataset included the colors:
color in [red, blue, purple]
Often it is best for categorical features to have their own dimenions:
The enriched iris feature set would hence be in this case:
Most often, data does not come in a nice, structured, CSV file where every column measures the same thing. In this case, we must be more imaginitive in how we extract features.
Here is an overview of strategies to turn unstructed data items into arrays of numerical features.
Note: we include other file formats such as HTML and PDF in this category: an ad-hoc preprocessing step is required to extract the plain text in UTF-8 encoding for instance.
For a tutorial on text processing in scikit-learn, see http://scikit-learn.github.com/scikit-learn-tutorial/working_with_text_data.html
Rescale the picture to a fixed size and take all the raw pixels values (with or without luminosity normalization)
Take some transformation of the signal (gradients in each pixel, wavelets transforms...)
Compute the Euclidean, Manhattan or cosine similarities of the sample to a set reference prototype images aranged in a code book. The code book may have been previously extracted from the same dataset using an unsupervised learning algorithm on the raw pixel signal. Each feature value is the distance to one element of the code book.
Perform local feature extraction: split the picture into small regions and perform feature extraction locally in each area, Then combine all the features of the individual areas into a single array.
Same type of strategies as for images; the difference its it's a 1D rather than 2D space.
For more information on feature extraction in scikit-learn, see http://scikit-learn.org/stable/modules/feature_extraction.html
In the next couple notebooks, we will explore a simple examples of classification, regression, dimensionality reduction, and clustering using the datasets we've seen.