#!/usr/bin/env python # coding: utf-8 # > This is one of the 100 recipes of the [IPython Cookbook](http://ipython-books.github.io/), the definitive guide to high-performance scientific computing and data science in Python. # # # 8.1. Getting started with scikit-learn # We will generate one-dimensional data with a simple model (including some noise), and we will try to fit a function to this data. With this function, we can predict values on new data points. This is a curve-fitting regression problem. # 1. First, let's make all the necessary imports. # In[ ]: import numpy as np import scipy.stats as st import sklearn.linear_model as lm import matplotlib.pyplot as plt get_ipython().run_line_magic('matplotlib', 'inline') # 2. We now define the deterministic function underlying our generative model. # In[ ]: f = lambda x: np.exp(3 * x) # 3. We generate the values along the curve on $[0, 2]$. # In[ ]: x_tr = np.linspace(0., 2, 200) y_tr = f(x_tr) # 4. Now, let's generate our data points within $[0, 1]$. We use the function $f$ and we add some Gaussian noise. In order to be able to demonstrate some effects, we use one specific set of data points generated in this fashion. # In[ ]: x = np.array([0, .1, .2, .5, .8, .9, 1]) # y = f(x) + np.random.randn(len(x)) y = np.array([0.59837698, 2.90450025, 4.73684354, 3.87158063, 11.77734608, 15.51112358, 20.08663964]) # 5. Let's plot our data points on $[0, 1]$. # In[ ]: plt.figure(figsize=(6,3)); plt.plot(x_tr[:100], y_tr[:100], '--k'); plt.plot(x, y, 'ok', ms=10); # 6. Now, we use scikit-learn to fit a linear model to the data. There are three steps. First, we create the model (an instance of the `LinearRegression` class). Then we fit the model to our data. Finally, we predict values from our trained model. # In[ ]: # We create the model. lr = lm.LinearRegression() # We train the model on our training dataset. lr.fit(x[:, np.newaxis], y); # Now, we predict points with our trained model. y_lr = lr.predict(x_tr[:, np.newaxis]) # We need to convert `x` and `x_tr` to column vectors, as it is a general convention in scikit-learn that observations are rows, while features are columns. Here, we have 7 observations with 1 feature. # 7. We now plot the result of the trained linear model. We obtain a regression line, in green here. # In[ ]: plt.figure(figsize=(6,3)); plt.plot(x_tr, y_tr, '--k'); plt.plot(x_tr, y_lr, 'g'); plt.plot(x, y, 'ok', ms=10); plt.xlim(0, 1); plt.ylim(y.min()-1, y.max()+1); plt.title("Linear regression"); # 8. The linear fit is not well adapted here, since the data points are generated according to a non-linear model (an exponential curve). Therefore, we are now going to fit a non-linear model. More precisely, we will fit a polynomial function to our data points. We can still use linear regression for that, by pre-computing the exponents of our data points. This is done by generating a Vandermonde matrix, using the `np.vander` function. We will explain this trick in more detail in *How it works...*. # In[ ]: lrp = lm.LinearRegression() plt.figure(figsize=(6,3)); plt.plot(x_tr, y_tr, '--k'); for deg, s in zip([2, 5], ['-', '.']): lrp.fit(np.vander(x, deg + 1), y); y_lrp = lrp.predict(np.vander(x_tr, deg + 1)) plt.plot(x_tr, y_lrp, s, label='degree ' + str(deg)); plt.legend(loc=2); plt.xlim(0, 1.4); plt.ylim(-10, 40); # Print the model's coefficients. print(' '.join(['%.2f' % c for c in lrp.coef_])) plt.plot(x, y, 'ok', ms=10); plt.title("Linear regression"); # We have fitted two polynomial models of degree 2 and 5. The degree 2 polynomial fits the data points less precisely than the degree 5 polynomial. However, it seems more robust: the degree 5 polynomial seems really bad at predicting values outside the data points (look for example at the portion $x \geq 1$). This is what we call **overfitting**: by using a model too complex, we obtain a better fit on the trained dataset, but a less robust model outside this set. # 9. We will now use a different learning model, called **ridge regression**. It works like linear regression, except that it prevents the polynomial's coefficients to explode (which was what happened in the overfitting example above). By adding a **regularization term** in the **loss function**, ridge regression imposes some structure on the underlying model. We will see more details in the next section. # The ridge regression model has a meta-parameter which represents the weight of the regularization term. We could try different values with trials and errors, using the `Ridge` class. However, scikit-learn includes another model called `RidgeCV` which includes a parameter search with cross-validation. In practice, it means that you don't have to tweak this parameter by hand: scikit-learn does it for you. Since the models of scikit-learn always follow the `fit`-`predict` API, all we have to do is replace `lm.LinearRegression` by `lm.RidgeCV` in the code above. We will give more details in the next section. # In[ ]: ridge = lm.RidgeCV() plt.figure(figsize=(6,3)); plt.plot(x_tr, y_tr, '--k'); for deg, s in zip([2, 5], ['-', '.']): ridge.fit(np.vander(x, deg + 1), y); y_ridge = ridge.predict(np.vander(x_tr, deg + 1)) plt.plot(x_tr, y_ridge, s, label='degree ' + str(deg)); plt.legend(loc=2); plt.xlim(0, 1.5); plt.ylim(-5, 80); # Print the model's coefficients. print(' '.join(['%.2f' % c for c in ridge.coef_])) plt.plot(x, y, 'ok', ms=10); plt.title("Ridge regression"); # This time, the degree 5 polynomial seems better than the simpler degree 2 polynomial (which now causes **underfitting**). The ridge regression reduces the overfitting issue here. # > You'll find all the explanations, figures, references, and much more in the book (to be released later this summer). # # > [IPython Cookbook](http://ipython-books.github.io/), by [Cyrille Rossant](http://cyrille.rossant.net), Packt Publishing, 2014 (500 pages).