#!/usr/bin/env python # coding: utf-8 # Stochastic Gradient descent # =========================== # # *Important:* Please read the [installation page](http://gpeyre.github.io/numerical-tours/installation_python/) for details about how to install the toolboxes. # $\newcommand{\dotp}[2]{\langle #1, #2 \rangle}$ # $\newcommand{\enscond}[2]{\lbrace #1, #2 \rbrace}$ # $\newcommand{\pd}[2]{ \frac{ \partial #1}{\partial #2} }$ # $\newcommand{\umin}[1]{\underset{#1}{\min}\;}$ # $\newcommand{\umax}[1]{\underset{#1}{\max}\;}$ # $\newcommand{\umin}[1]{\underset{#1}{\min}\;}$ # $\newcommand{\uargmin}[1]{\underset{#1}{argmin}\;}$ # $\newcommand{\norm}[1]{\|#1\|}$ # $\newcommand{\abs}[1]{\left|#1\right|}$ # $\newcommand{\choice}[1]{ \left\{ \begin{array}{l} #1 \end{array} \right. }$ # $\newcommand{\pa}[1]{\left(#1\right)}$ # $\newcommand{\diag}[1]{{diag}\left( #1 \right)}$ # $\newcommand{\qandq}{\quad\text{and}\quad}$ # $\newcommand{\qwhereq}{\quad\text{where}\quad}$ # $\newcommand{\qifq}{ \quad \text{if} \quad }$ # $\newcommand{\qarrq}{ \quad \Longrightarrow \quad }$ # $\newcommand{\ZZ}{\mathbb{Z}}$ # $\newcommand{\CC}{\mathbb{C}}$ # $\newcommand{\RR}{\mathbb{R}}$ # $\newcommand{\EE}{\mathbb{E}}$ # $\newcommand{\Zz}{\mathcal{Z}}$ # $\newcommand{\Ww}{\mathcal{W}}$ # $\newcommand{\Vv}{\mathcal{V}}$ # $\newcommand{\Nn}{\mathcal{N}}$ # $\newcommand{\NN}{\mathcal{N}}$ # $\newcommand{\Hh}{\mathcal{H}}$ # $\newcommand{\Bb}{\mathcal{B}}$ # $\newcommand{\Ee}{\mathcal{E}}$ # $\newcommand{\Cc}{\mathcal{C}}$ # $\newcommand{\Gg}{\mathcal{G}}$ # $\newcommand{\Ss}{\mathcal{S}}$ # $\newcommand{\Pp}{\mathcal{P}}$ # $\newcommand{\Ff}{\mathcal{F}}$ # $\newcommand{\Xx}{\mathcal{X}}$ # $\newcommand{\Mm}{\mathcal{M}}$ # $\newcommand{\Ii}{\mathcal{I}}$ # $\newcommand{\Dd}{\mathcal{D}}$ # $\newcommand{\Ll}{\mathcal{L}}$ # $\newcommand{\Tt}{\mathcal{T}}$ # $\newcommand{\si}{\sigma}$ # $\newcommand{\al}{\alpha}$ # $\newcommand{\la}{\lambda}$ # $\newcommand{\ga}{\gamma}$ # $\newcommand{\Ga}{\Gamma}$ # $\newcommand{\La}{\Lambda}$ # $\newcommand{\si}{\sigma}$ # $\newcommand{\Si}{\Sigma}$ # $\newcommand{\be}{\beta}$ # $\newcommand{\de}{\delta}$ # $\newcommand{\De}{\Delta}$ # $\newcommand{\phi}{\varphi}$ # $\newcommand{\th}{\theta}$ # $\newcommand{\om}{\omega}$ # $\newcommand{\Om}{\Omega}$ # $\newcommand{\eqdef}{\equiv}$ # This tour details Stochastic Gradient Descent, applied to the binary logistic classification problem. # # We recommend that after doing this Numerical Tours, you apply it to your # own data, for instance using a dataset from [LibSVM](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/). # # _Disclaimer:_ these machine learning tours are intended to be # overly-simplistic implementations and applications of baseline machine learning methods. # For more advanced uses and implementations, we recommend # to use a state-of-the-art library, the most well known being # [Scikit-Learn](http://scikit-learn.org/) # In[1]: get_ipython().run_line_magic('matplotlib', 'inline') import numpy as np import matplotlib.pyplot as plt get_ipython().run_line_magic('load_ext', 'autoreload') get_ipython().run_line_magic('autoreload', '2') # Usefull function to convert to a column vector. # In[2]: # convert to a column vector def MakeCol(y): return y.reshape(-1,1) # convert to a row vector def MakeRow(y): return y.reshape(1,-1) # find non zero/true elements def find(x): return np.nonzero(x)[0] # Simple Example # -------------- # We first illustrate the concept of stochastic gradient descent on a # simple example # $$ \umin{x \in \RR} f(x) \eqdef f_1(x) + f_2(x) $$ # where $f_1(x) \eqdef (x-1)^2$ and # $f_2(x) \eqdef (x+1)^2$. # # # Functions and their derivatives. # In[3]: def f(x,i): if i==1: return 1/2*(x-1)**2 else: return 1/2*(x+1)**2 def df(x,i): if i==1: return x-1 else: return x+1 def F(x): return f(x,1)+f(x,2) # In[4]: t = np.linspace(-3,3,201) plt.plot(t,f(t,1)) plt.plot(t,f(t,2)) plt.plot(t,F(t)) plt.legend(('$f_1$', '$f_2$', '$F$')); # Each iteration of SGD reads # $$ x_{\ell+1} = x_\ell - \tau_\ell \nabla f_{i(\ell)}(x_\ell) $$ # where $ i(\ell) \in \{1,2\} $ is drawn uniformly. # __Exercise 0__ # # Implement the SGD, with a random initial condition $x_0$. # Display several paths (i.e. run several times the algorithm) # to vizualize the evolution of the density of the random variable # $x_\ell$. # In[5]: run -i nt_solutions/ml_4_sgd/exo0 # In[6]: ## Insert your code here. # Dataset Loading # --------------- # We load a subset of the # of $n=10000$ features in dimension $78$. The goal in this task is to learn a classification rule that differentiates between two types of particles generated in high energy collider experiments. # # # # Load the dataset. # Randomly permute it. # Separate the features $X$ from the data $y$ to predict information. # In[7]: from scipy import io name = 'quantum'; U = io.loadmat('nt_toolbox/data/ml-' + name) A = U['A'] A = A[np.random.permutation(A.shape[0]),:] X = A[:,0:-1] y = A[:,-1] # Set the classes indexes to be $\{-1,+1\}$. # In[8]: y = 2*y-1 # y must be a column vector y = MakeCol(y) # Remove empty features, normalize $X$. # In[9]: I = find( np.abs(X).mean(axis=0)>1e-1 ) X = X[:,I] # normalize X = X-X.mean(axis=0) X = X/X.std(axis=0) # $n$ is the number of samples, $p$ is the dimensionality of the features, # In[10]: [n,p] = X.shape print(n,p) # Compute PCA main axes for display. # In[11]: U, s, V = np.linalg.svd(X) Xr = X.dot( V.transpose() ) # display the decay of eigenvalues plt.plot(s, '.-'); # Plot the classes. # In[12]: I = find(y==-1) J = find(y==+1) plt.plot(Xr[I[0:200],0], Xr[I[0:200],1], '.') plt.plot(Xr[J[0:200],0], Xr[J[0:200],1], '.') plt.axis('equal'); # In[13]: from mpl_toolkits.mplot3d import Axes3D fig = plt.figure() ax = Axes3D(fig) plt.clf ax.scatter(Xr[I[0:200],0], Xr[I[0:200],1], Xr[I[0:200],2], '.') ax.scatter(Xr[J[0:200],0], Xr[J[0:200],1], Xr[J[0:200],2], '.') plt.axis('tight'); # Batch Gradient Descent (BGD) # ---------------------------- # We first test the usual (batch) gradient descent (BGD) on the problem of # supervised logistic classification. # # # We refer to the dedicated numerical tour on logistic classification for # background and more details about the derivations of the energy and its # gradient. # # # Logistic classification aims at solving the following convex program # $$ \umin{w} E(w) \eqdef \frac{1}{n} \sum_{i=1}^n L(\dotp{x_i}{w},y_i) $$ # where the logistic loss reads # $$ L( s,y ) \eqdef \log( 1+\exp(-sy) ) $$ # # # Define energy $E$ and its gradient $\nabla E$. # In[14]: def L(s,y): return 1/n * np.sum( np.log( 1 + np.exp(-s*y) ) ) def E(w,X,y): return L(X.dot(w),y) def theta(v): return 1 / (1+np.exp(-v)) def nablaL(s,r): return - 1/n * y * theta(-s * y) def nablaE(w,X,y): return X.transpose().dot( nablaL(X.dot(w),y) ) # __Exercise 1__ # # Implement a gradient descent # $$ w_{\ell+1} = w_\ell - \tau_\ell \nabla E(w_\ell). $$ # Monitor the energy decay. # Test different step size, and compare with the theory (in particular # plot in log domain to illustrate the linear rate). # In[15]: run -i nt_solutions/ml_4_sgd/exo1 # In[16]: ## Insert your code here. # Stochastic Gradient Descent (SGD) # --------------------------------- # As any empirical risk minimization procedure, the # logistic classification minimization problem can be written as # $$ \umin{w} E(w) = \frac{1}{n} \sum_i E_i(w) \qwhereq E_i(w) = L(\dotp{x_i}{w},y_i). $$ # # # For very large $n$ (which could in theory even be infinite, in which case the sum needs to be replaced # by an expectation or equivalenty an integral), computing $\nabla E$ is prohebitive. # It is possible instead to use a stochastic gradient descent (SGD) scheme # $$ w_{\ell+1} = w_\ell - \tau_\ell \nabla E_{i(\ell)}(w_\ell) $$ # where, for each iteration index $\ell$, $i(\ell)$ # is drawn uniformly at random in $ \{1,\ldots,n\} $. # # # Note that here # $$ \nabla E_{i}(w) = x_i \nabla L( \dotp{x_i}{w}, y_i ) # \qwhereq \nabla L(u,v) = v \odot \th(-u) $$ # In[17]: def nablaEi(w,i): return MakeCol( -y[i] * X[i,:].transpose() * theta( -y[i] * (X[i,:].dot(w)) ) ) # Note that each step of a batch gradient descent has complexity $O(np)$, # while a step of SGD only has complexity $O(p)$. SGD is thus # advantageous when $n$ is very large, and one cannot afford to do # several passes through the data. In some situation, SGD can provide # accurate results even with $\ell \ll n$, exploiting redundancy between # the samples. # # # A crucial question is the choice of step size schedule $\tau_\ell$. It # must tends to 0 in order to cancel the noise induced on the gradient by # the stochastic sampling. But it should not go too fast to zero in order # for the method to keep converging. # # # A typical schedule that ensures both properties is to have asymptically $\tau_\ell \sim \ell^{-1}$ for # $\ell\rightarrow +\infty$. We thus propose to use # $$ \tau_\ell \eqdef \frac{\tau_0}{1 + \ell/\ell_0} $$ # where $\ell_0$ indicates roughly the number of iterations serving as a # "warmup" phase. # # # One can prove the following convergence result # $$ \EE( E(w_\ell) ) - E(w^\star) = O\pa{ \frac{1}{\sqrt{\ell}} }, $$ # where $\EE$ indicates an expectation with respect to the i.i.d. # sampling performed at each iteration. # # # Select default values for $ (\ell_0,\tau_0) $. # In[18]: l0 = 100 tau0 = .05 # __Exercise 2__ # # Perform the Stochastic gradient descent. # Display the evolution of the energy $E(w_\ell)$. # One can overlay on top (black dashed curve) the convergence of the batch gradient descent, with a carefull scaling of the # number of iteration to account for the fact that the complexity of a batch iteration is $n$ times larger. # Perform several runs to illustrate the probabilistic nature of the method. # Explore different values for $ (\ell_0,\tau_0) $. # In[19]: run -i nt_solutions/ml_4_sgd/exo2 # In[20]: ## Insert your code here. # Stochastic Gradient Descent with Averaging (SGA) # ------------------------------------------------ # Stochastic gradient descent is slow because of the fast decay of # $\tau_\ell$ toward zero. # # # To improve somehow the convergence speed, it is possible to average the past # iterate, i.e. run a "classical" SGD on auxiliary variables $ (\tilde w_\ell)_\ell$ # $$ \tilde w_{\ell+1} = \tilde w_\ell - \tau_\ell \nabla E_{i(\ell)}(\tilde w_\ell) $$ # and output as estimated weight vector the average # $$ w_\ell \eqdef \frac{1}{\ell} \sum_{k=1}^\ell \tilde w_\ell. $$ # This defines the Stochastic Gradient Descent with Averaging (SGA) # algorithm. # # # Note that it is possible to avoid explicitely storing all the iterates by simply # updating a running average as follow # $$ w_{\ell+1} = \frac{1}{\ell} \tilde w_\ell + \frac{\ell-1}{\ell} w_\ell. $$ # # # In this case, a typical choice of decay is rather of the form # $$ \tau_\ell \eqdef \frac{\tau_0}{1 + \sqrt{\ell/\ell_0}}. $$ # Notice that the step size now goes much slower to 0, at rate $\ell^{-1/2}$. # # # Typically, because the averaging stabilizes the iterates, the choice of # $(\ell_0,\tau_0)$ is less important than for SGD. # # # [Bach proves that](https://arxiv.org/pdf/1303.6149.pdf) for logistic classification, # it leads to a faster convergence (the constant involved are # smaller) than SGD, since # on contrast to SGD, SGA is adaptive to the local strong convexity of $E$. # __Exercise 3__ # # Implement the Stochastic gradient descent with averaging. # Display the evolution of the energy $E(w_\ell)$. # In[21]: run -i nt_solutions/ml_4_sgd/exo3 # In[22]: ## Insert your code here. # Stochastic Averaged Gradient Descent (SAG) # ------------------------------------------ # For problem size $n$ where the dataset (of size $n \times p$) can # fully fit into memory, it is possible to further improve the SGA method # by bookeeping the previous gradient. This gives rise to the # [Stochastic Averaged Gradient Descent (SAG)](https://arxiv.org/pdf/1309.2388) algorithm. # # # We stored all the previously computed gradient in $ (G^i)_{i=1}^n $, # which necessitate $n \times p$ memory. # The iterates are defined by using a proxy $g$ for the batch gradient, # which is progressively enhanced during the iterates. # # # The algorithm reads # $$ h \leftarrow \nabla E_{i(\ell)}(\tilde w_\ell), $$ # $$ g \leftarrow g - G^{i(\ell)} + h, $$ # $$ G^{i(\ell)} \leftarrow h, $$ # $$ w_{\ell+1} = w_\ell - \tau g. $$ # Note that in contrast to SGD and SGA, this method uses a fixed step # size $\tau$. Similarely to the BGD, in order to ensure convergence, # the step size $\tau$ should be of the order of $1/L$ # where $L$ is the Lipschitz constant of $E$. # # # This algorithm improves over SGA and BGD # since it has a convergence rate of $O(1/\ell)$. # Furthermore, in the presence of strong convexity (for instance when $X$ is # injective for logistic classification), it has a linear convergence rate, # i.e. # $$ \EE( E(w_\ell) ) - E(w^\star) = O\pa{ \rho^\ell }, $$ # for some $0 < \rho < 1$. # # # Note that this improvement over SGD and SGA is made possible only because # SAG explictely use the fact that $n$ is finite (while SGD and SGA can # be extended to infinite $n$ and more general minimization of # expectations). # __Exercise 4__ # # Implement SAG. # Display the evolution of the energy $E(w_\ell)$. # In[23]: run -i nt_solutions/ml_4_sgd/exo4 # In[24]: ## Insert your code here.