#!/usr/bin/env python
# coding: utf-8

# > This is one of the 100 recipes of the [IPython Cookbook](http://ipython-books.github.io/), the definitive guide to high-performance scientific computing and data science in Python.
# 

# # 8.1. Getting started with scikit-learn

# We will generate one-dimensional data with a simple model (including some noise), and we will try to fit a function to this data. With this function, we can predict values on new data points. This is a curve-fitting regression problem.

# 1. First, let's make all the necessary imports.

# In[ ]:


import numpy as np
import scipy.stats as st
import sklearn.linear_model as lm
import matplotlib.pyplot as plt
get_ipython().run_line_magic('matplotlib', 'inline')


# 2. We now define the deterministic function underlying our generative model.

# In[ ]:


f = lambda x: np.exp(3 * x)


# 3. We generate the values along the curve on $[0, 2]$.

# In[ ]:


x_tr = np.linspace(0., 2, 200)
y_tr = f(x_tr)


# 4. Now, let's generate our data points within $[0, 1]$. We use the function $f$ and we add some Gaussian noise. In order to be able to demonstrate some effects, we use one specific set of data points generated in this fashion.

# In[ ]:


x = np.array([0, .1, .2, .5, .8, .9, 1])
# y = f(x) + np.random.randn(len(x))
y = np.array([0.59837698, 2.90450025, 4.73684354, 3.87158063, 11.77734608, 15.51112358, 20.08663964])


# 5. Let's plot our data points on $[0, 1]$.

# In[ ]:


plt.figure(figsize=(6,3));
plt.plot(x_tr[:100], y_tr[:100], '--k');
plt.plot(x, y, 'ok', ms=10);


# 6. Now, we use scikit-learn to fit a linear model to the data. There are three steps. First, we create the model (an instance of the `LinearRegression` class). Then we fit the model to our data. Finally, we predict values from our trained model.

# In[ ]:


# We create the model.
lr = lm.LinearRegression()
# We train the model on our training dataset.
lr.fit(x[:, np.newaxis], y);
# Now, we predict points with our trained model.
y_lr = lr.predict(x_tr[:, np.newaxis])


# We need to convert `x` and `x_tr` to column vectors, as it is a general convention in scikit-learn that observations are rows, while features are columns. Here, we have 7 observations with 1 feature.

# 7. We now plot the result of the trained linear model. We obtain a regression line, in green here.

# In[ ]:


plt.figure(figsize=(6,3));
plt.plot(x_tr, y_tr, '--k');
plt.plot(x_tr, y_lr, 'g');
plt.plot(x, y, 'ok', ms=10);
plt.xlim(0, 1);
plt.ylim(y.min()-1, y.max()+1);
plt.title("Linear regression");


# 8. The linear fit is not well adapted here, since the data points are generated according to a non-linear model (an exponential curve). Therefore, we are now going to fit a non-linear model. More precisely, we will fit a polynomial function to our data points. We can still use linear regression for that, by pre-computing the exponents of our data points. This is done by generating a Vandermonde matrix, using the `np.vander` function. We will explain this trick in more detail in *How it works...*.

# In[ ]:


lrp = lm.LinearRegression()
plt.figure(figsize=(6,3));
plt.plot(x_tr, y_tr, '--k');

for deg, s in zip([2, 5], ['-', '.']):
    lrp.fit(np.vander(x, deg + 1), y);
    y_lrp = lrp.predict(np.vander(x_tr, deg + 1))
    plt.plot(x_tr, y_lrp, s, label='degree ' + str(deg));
    plt.legend(loc=2);
    plt.xlim(0, 1.4);
    plt.ylim(-10, 40);
    # Print the model's coefficients.
    print(' '.join(['%.2f' % c for c in lrp.coef_]))
plt.plot(x, y, 'ok', ms=10);
plt.title("Linear regression");


# We have fitted two polynomial models of degree 2 and 5. The degree 2 polynomial fits the data points less precisely than the degree 5 polynomial. However, it seems more robust: the degree 5 polynomial seems really bad at predicting values outside the data points (look for example at the portion $x \geq 1$). This is what we call **overfitting**: by using a model too complex, we obtain a better fit on the trained dataset, but a less robust model outside this set.

# 9. We will now use a different learning model, called **ridge regression**. It works like linear regression, except that it prevents the polynomial's coefficients to explode (which was what happened in the overfitting example above). By adding a **regularization term** in the **loss function**, ridge regression imposes some structure on the underlying model. We will see more details in the next section.

# The ridge regression model has a meta-parameter which represents the weight of the regularization term. We could try different values with trials and errors, using the `Ridge` class. However, scikit-learn includes another model called `RidgeCV` which includes a parameter search with cross-validation. In practice, it means that you don't have to tweak this parameter by hand: scikit-learn does it for you. Since the models of scikit-learn always follow the `fit`-`predict` API, all we have to do is replace `lm.LinearRegression` by `lm.RidgeCV` in the code above. We will give more details in the next section.

# In[ ]:


ridge = lm.RidgeCV()
plt.figure(figsize=(6,3));
plt.plot(x_tr, y_tr, '--k');

for deg, s in zip([2, 5], ['-', '.']):
    ridge.fit(np.vander(x, deg + 1), y);
    y_ridge = ridge.predict(np.vander(x_tr, deg + 1))
    plt.plot(x_tr, y_ridge, s, label='degree ' + str(deg));
    plt.legend(loc=2);
    plt.xlim(0, 1.5);
    plt.ylim(-5, 80);
    # Print the model's coefficients.
    print(' '.join(['%.2f' % c for c in ridge.coef_]))

plt.plot(x, y, 'ok', ms=10);
plt.title("Ridge regression");


# This time, the degree 5 polynomial seems better than the simpler degree 2 polynomial (which now causes **underfitting**). The ridge regression reduces the overfitting issue here.

# > You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).
# 
# > [IPython Cookbook](http://ipython-books.github.io/), by [Cyrille Rossant](http://cyrille.rossant.net), Packt Publishing, 2014 (500 pages).