#!/usr/bin/env python
# coding: utf-8

# Stochastic Gradient descent
# ===========================
# 
# *Important:* Please read the [installation page](http://gpeyre.github.io/numerical-tours/installation_python/) for details about how to install the toolboxes.
# $\newcommand{\dotp}[2]{\langle #1, #2 \rangle}$
# $\newcommand{\enscond}[2]{\lbrace #1, #2 \rbrace}$
# $\newcommand{\pd}[2]{ \frac{ \partial #1}{\partial #2} }$
# $\newcommand{\umin}[1]{\underset{#1}{\min}\;}$
# $\newcommand{\umax}[1]{\underset{#1}{\max}\;}$
# $\newcommand{\umin}[1]{\underset{#1}{\min}\;}$
# $\newcommand{\uargmin}[1]{\underset{#1}{argmin}\;}$
# $\newcommand{\norm}[1]{\|#1\|}$
# $\newcommand{\abs}[1]{\left|#1\right|}$
# $\newcommand{\choice}[1]{ \left\{  \begin{array}{l} #1 \end{array} \right. }$
# $\newcommand{\pa}[1]{\left(#1\right)}$
# $\newcommand{\diag}[1]{{diag}\left( #1 \right)}$
# $\newcommand{\qandq}{\quad\text{and}\quad}$
# $\newcommand{\qwhereq}{\quad\text{where}\quad}$
# $\newcommand{\qifq}{ \quad \text{if} \quad }$
# $\newcommand{\qarrq}{ \quad \Longrightarrow \quad }$
# $\newcommand{\ZZ}{\mathbb{Z}}$
# $\newcommand{\CC}{\mathbb{C}}$
# $\newcommand{\RR}{\mathbb{R}}$
# $\newcommand{\EE}{\mathbb{E}}$
# $\newcommand{\Zz}{\mathcal{Z}}$
# $\newcommand{\Ww}{\mathcal{W}}$
# $\newcommand{\Vv}{\mathcal{V}}$
# $\newcommand{\Nn}{\mathcal{N}}$
# $\newcommand{\NN}{\mathcal{N}}$
# $\newcommand{\Hh}{\mathcal{H}}$
# $\newcommand{\Bb}{\mathcal{B}}$
# $\newcommand{\Ee}{\mathcal{E}}$
# $\newcommand{\Cc}{\mathcal{C}}$
# $\newcommand{\Gg}{\mathcal{G}}$
# $\newcommand{\Ss}{\mathcal{S}}$
# $\newcommand{\Pp}{\mathcal{P}}$
# $\newcommand{\Ff}{\mathcal{F}}$
# $\newcommand{\Xx}{\mathcal{X}}$
# $\newcommand{\Mm}{\mathcal{M}}$
# $\newcommand{\Ii}{\mathcal{I}}$
# $\newcommand{\Dd}{\mathcal{D}}$
# $\newcommand{\Ll}{\mathcal{L}}$
# $\newcommand{\Tt}{\mathcal{T}}$
# $\newcommand{\si}{\sigma}$
# $\newcommand{\al}{\alpha}$
# $\newcommand{\la}{\lambda}$
# $\newcommand{\ga}{\gamma}$
# $\newcommand{\Ga}{\Gamma}$
# $\newcommand{\La}{\Lambda}$
# $\newcommand{\si}{\sigma}$
# $\newcommand{\Si}{\Sigma}$
# $\newcommand{\be}{\beta}$
# $\newcommand{\de}{\delta}$
# $\newcommand{\De}{\Delta}$
# $\newcommand{\phi}{\varphi}$
# $\newcommand{\th}{\theta}$
# $\newcommand{\om}{\omega}$
# $\newcommand{\Om}{\Omega}$
# $\newcommand{\eqdef}{\equiv}$

# This tour details Stochastic Gradient Descent, applied to the binary logistic classification problem.
# 
# We recommend that after doing this Numerical Tours, you apply it to your
# own data, for instance using a dataset from [LibSVM](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/).
# 
# _Disclaimer:_ these machine learning tours are intended to be
# overly-simplistic implementations and applications of baseline machine learning methods.
# For more advanced uses and implementations, we recommend
# to use a state-of-the-art library, the most well known being
# [Scikit-Learn](http://scikit-learn.org/)

# In[1]:


get_ipython().run_line_magic('matplotlib', 'inline')
import numpy as np
import matplotlib.pyplot as plt
get_ipython().run_line_magic('load_ext', 'autoreload')
get_ipython().run_line_magic('autoreload', '2')


# Usefull function to convert to a column vector.

# In[2]:


#  convert to a column vector
def MakeCol(y): return y.reshape(-1,1)
#  convert to a row vector
def MakeRow(y): return y.reshape(1,-1)
# find non zero/true elements
def find(x): return np.nonzero(x)[0]


# Simple Example
# --------------
# We first illustrate the concept of stochastic gradient descent on a
# simple example
# $$ \umin{x \in \RR} f(x) \eqdef f_1(x) + f_2(x) $$
# where $f_1(x) \eqdef (x-1)^2$ and
# $f_2(x) \eqdef (x+1)^2$.
# 
# 
# Functions and their derivatives.

# In[3]:


def f(x,i):
    if i==1:
        return 1/2*(x-1)**2
    else:
        return 1/2*(x+1)**2
def df(x,i):
    if i==1:
        return x-1
    else:
        return x+1
def F(x): return f(x,1)+f(x,2)


# In[4]:


t = np.linspace(-3,3,201)
plt.plot(t,f(t,1))
plt.plot(t,f(t,2))
plt.plot(t,F(t))
plt.legend(('$f_1$', '$f_2$', '$F$'));


# Each iteration of SGD reads
# $$ x_{\ell+1} = x_\ell - \tau_\ell \nabla f_{i(\ell)}(x_\ell) $$
# where $ i(\ell) \in \{1,2\} $ is drawn uniformly.

# __Exercise 0__
# 
# Implement the SGD, with a random initial condition $x_0$.
# Display several paths (i.e. run several times the algorithm)
# to vizualize the evolution of the density of the random variable
# $x_\ell$.

# In[5]:


run -i nt_solutions/ml_4_sgd/exo0


# In[6]:


## Insert your code here.


# Dataset Loading
# ---------------
# We load a subset of the <http://osmot.cs.cornell.edu/kddcup/datasets.html
# dataset Quantum Physics Dataset>
# of $n=10000$ features in dimension $78$. The goal in this task is to learn a classification rule that differentiates between two types of particles generated in high energy collider experiments.
# 
# 
# 
# Load the dataset.
# Randomly permute it.
# Separate the features $X$ from the data $y$ to predict information.

# In[7]:


from scipy import io
name = 'quantum';
U = io.loadmat('nt_toolbox/data/ml-' + name)
A = U['A']
A = A[np.random.permutation(A.shape[0]),:]
X = A[:,0:-1]
y = A[:,-1]


# Set the classes indexes to be $\{-1,+1\}$.

# In[8]:


y = 2*y-1
# y must be a column vector
y = MakeCol(y)


# Remove empty features, normalize $X$.

# In[9]:


I = find( np.abs(X).mean(axis=0)>1e-1 )
X = X[:,I]
# normalize
X = X-X.mean(axis=0)
X = X/X.std(axis=0)


# $n$ is the number of samples, $p$ is the dimensionality of the features,

# In[10]:


[n,p] = X.shape
print(n,p)


# Compute PCA main axes for display.

# In[11]:


U, s, V = np.linalg.svd(X)
Xr = X.dot( V.transpose() )
# display the decay of eigenvalues
plt.plot(s, '.-');


# Plot the classes.

# In[12]:


I = find(y==-1)
J = find(y==+1)
plt.plot(Xr[I[0:200],0], Xr[I[0:200],1], '.')
plt.plot(Xr[J[0:200],0], Xr[J[0:200],1], '.')
plt.axis('equal');


# In[13]:


from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = Axes3D(fig)
plt.clf
ax.scatter(Xr[I[0:200],0], Xr[I[0:200],1], Xr[I[0:200],2], '.')
ax.scatter(Xr[J[0:200],0], Xr[J[0:200],1], Xr[J[0:200],2], '.')
plt.axis('tight');


# Batch Gradient Descent (BGD)
# ----------------------------
# We first test the usual (batch) gradient descent (BGD) on the problem of
# supervised logistic classification.
# 
# 
# We refer to the dedicated numerical tour on logistic classification for
# background and more details about the derivations of the energy and its
# gradient.
# 
# 
# Logistic classification aims at solving the following convex program
#   $$ \umin{w} E(w) \eqdef \frac{1}{n} \sum_{i=1}^n L(\dotp{x_i}{w},y_i)  $$
# where the logistic loss reads
#   $$ L( s,y ) \eqdef \log( 1+\exp(-sy) ) $$
# 
# 
# Define energy $E$ and its gradient $\nabla E$.

# In[14]:


def L(s,y):  return 1/n * np.sum( np.log( 1 + np.exp(-s*y) ) )
def E(w,X,y): return L(X.dot(w),y)
def theta(v): return 1 / (1+np.exp(-v))
def nablaL(s,r): return - 1/n * y * theta(-s * y)
def nablaE(w,X,y): return X.transpose().dot( nablaL(X.dot(w),y) )


# __Exercise 1__
# 
# Implement a gradient descent
# $$ w_{\ell+1} = w_\ell - \tau_\ell \nabla E(w_\ell). $$
# Monitor the energy decay.
# Test different step size, and compare with the theory (in particular
# plot in log domain to illustrate the linear rate).

# In[15]:


run -i nt_solutions/ml_4_sgd/exo1


# In[16]:


## Insert your code here.


# Stochastic Gradient Descent (SGD)
# ---------------------------------
# As any empirical risk minimization procedure, the
# logistic classification minimization problem can be written as
# $$ \umin{w} E(w) = \frac{1}{n} \sum_i E_i(w) \qwhereq E_i(w) = L(\dotp{x_i}{w},y_i). $$
# 
# 
# For very large $n$ (which could in theory even be infinite, in which case the sum needs to be replaced
# by an expectation or equivalenty an integral), computing $\nabla E$ is prohebitive.
# It is possible instead to use a stochastic gradient descent (SGD) scheme
#   $$ w_{\ell+1} = w_\ell - \tau_\ell \nabla E_{i(\ell)}(w_\ell) $$
# where, for each iteration index $\ell$, $i(\ell)$
# is drawn uniformly at random in $  \{1,\ldots,n\} $.
# 
# 
# Note that here
# $$ \nabla E_{i}(w) = x_i \nabla L( \dotp{x_i}{w}, y_i )
#   \qwhereq  \nabla L(u,v) = v \odot \th(-u)  $$

# In[17]:


def nablaEi(w,i): return MakeCol( -y[i] * X[i,:].transpose() * theta( -y[i] * (X[i,:].dot(w)) ) )


# Note that each step of a batch gradient descent has complexity $O(np)$,
# while a step of SGD only has complexity $O(p)$. SGD is thus
# advantageous when $n$ is very large, and one cannot afford to do
# several passes through the data. In some situation, SGD can provide
# accurate results even with $\ell \ll n$, exploiting redundancy between
# the samples.
# 
# 
# A crucial question is the choice of step size schedule $\tau_\ell$. It
# must tends to 0 in order to cancel the noise induced on the gradient by
# the stochastic sampling. But it should not go too fast to zero in order
# for the method to keep converging.
# 
# 
# A typical schedule that ensures both properties is to have asymptically $\tau_\ell \sim \ell^{-1}$ for
# $\ell\rightarrow +\infty$. We thus propose to use
# $$ \tau_\ell \eqdef \frac{\tau_0}{1 + \ell/\ell_0} $$
# where $\ell_0$ indicates roughly the number of iterations serving as a
# "warmup" phase.
# 
# 
# One can prove the following convergence result
#  $$ \EE( E(w_\ell) ) - E(w^\star) = O\pa{ \frac{1}{\sqrt{\ell}} }, $$
# where $\EE$ indicates an expectation with respect to the i.i.d.
# sampling performed at each iteration.
# 
# 
# Select default values for $ (\ell_0,\tau_0) $.

# In[18]:


l0 = 100
tau0 = .05


# __Exercise 2__
# 
# Perform the Stochastic gradient descent.
# Display the evolution of the energy $E(w_\ell)$.
# One can overlay on top (black dashed curve) the convergence of the batch gradient descent, with a carefull scaling of the
# number of iteration to account for the fact that the complexity of a batch iteration is $n$ times larger.
# Perform several runs to illustrate the probabilistic nature of the method.
# Explore different values for $ (\ell_0,\tau_0) $.

# In[19]:


run -i nt_solutions/ml_4_sgd/exo2


# In[20]:


## Insert your code here.


# Stochastic Gradient Descent with Averaging (SGA)
# ------------------------------------------------
# Stochastic gradient descent is slow because of the fast decay of
# $\tau_\ell$ toward zero.
# 
# 
# To improve somehow the convergence speed, it is possible to average the past
# iterate, i.e. run a "classical" SGD on auxiliary variables $ (\tilde w_\ell)_\ell$
#   $$ \tilde w_{\ell+1} = \tilde w_\ell - \tau_\ell \nabla E_{i(\ell)}(\tilde w_\ell) $$
# and output as estimated weight vector the average
# $$ w_\ell \eqdef \frac{1}{\ell} \sum_{k=1}^\ell \tilde w_\ell. $$
# This defines the Stochastic Gradient Descent with Averaging (SGA)
# algorithm.
# 
# 
# Note that it is possible to avoid explicitely storing all the iterates by simply
# updating a running average as follow
# $$ w_{\ell+1} = \frac{1}{\ell} \tilde w_\ell +  \frac{\ell-1}{\ell} w_\ell.  $$
# 
# 
# In this case, a typical choice of decay is rather of the form
# $$ \tau_\ell \eqdef \frac{\tau_0}{1 + \sqrt{\ell/\ell_0}}. $$
# Notice that the step size now goes much slower to 0, at rate $\ell^{-1/2}$.
# 
# 
# Typically, because the averaging stabilizes the iterates, the choice of
# $(\ell_0,\tau_0)$ is less important than for SGD.
# 
# 
# [Bach proves that](https://arxiv.org/pdf/1303.6149.pdf) for logistic classification,
# it leads to a faster convergence (the constant involved are
# smaller) than SGD, since
# on contrast to SGD, SGA is adaptive to the local strong convexity of $E$.

# __Exercise 3__
# 
# Implement the Stochastic gradient descent with averaging.
# Display the evolution of the energy $E(w_\ell)$.

# In[21]:


run -i nt_solutions/ml_4_sgd/exo3


# In[22]:


## Insert your code here.


# Stochastic Averaged Gradient Descent (SAG)
# ------------------------------------------
# For problem size $n$ where the dataset (of size $n \times p$) can
# fully fit into memory, it is possible to further improve the SGA method
# by bookeeping the previous gradient. This gives rise to the
# [Stochastic Averaged Gradient Descent (SAG)](https://arxiv.org/pdf/1309.2388) algorithm.
# 
# 
# We stored all the previously computed gradient in $ (G^i)_{i=1}^n $,
# which necessitate $n \times p$ memory.
# The iterates are defined by using a proxy $g$ for the batch gradient,
# which is progressively enhanced during the iterates.
# 
# 
# The algorithm reads
# $$ h \leftarrow \nabla E_{i(\ell)}(\tilde w_\ell), $$
# $$ g  \leftarrow g - G^{i(\ell)} + h,   $$
# $$ G^{i(\ell)} \leftarrow h,  $$
# $$ w_{\ell+1} = w_\ell - \tau g.  $$
# Note that in contrast to SGD and SGA, this method uses a fixed step
# size $\tau$. Similarely to the BGD, in order to ensure convergence,
# the step size $\tau$ should be of the order of $1/L$
# where $L$ is the Lipschitz constant of $E$.
# 
# 
# This algorithm improves over SGA and BGD
# since it has a convergence rate of $O(1/\ell)$.
# Furthermore, in the presence of strong convexity (for instance when $X$ is
# injective for logistic classification), it has a linear convergence rate,
# i.e.
#  $$ \EE( E(w_\ell) ) - E(w^\star) = O\pa{ \rho^\ell }, $$
# for some $0 < \rho < 1$.
# 
# 
# Note that this improvement over SGD and SGA is made possible only because
# SAG explictely use the fact that $n$ is finite (while SGD and SGA can
# be extended to infinite $n$ and more general minimization of
# expectations).

# __Exercise 4__
# 
# Implement SAG.
# Display the evolution of the energy $E(w_\ell)$.

# In[23]:


run -i nt_solutions/ml_4_sgd/exo4


# In[24]:


## Insert your code here.

<script>
  $(document).ready(function(){
      $('div.prompt').hide();
  });
</script>