It is argued in [1] that the outputs of deep neural networks tend to concentrate on one dimensionsal manifolds the deeper a network gets. This notebook reproduces the pathology in the context of parametric deep nets, while [1] concentrated on deep Gaussian processes.
We also investigate whether this really is a problem for plain neural networks.
We show that the problem only exists for bad initialisations, where the units are driven into saturation. Despite of this bad initialisation, we show that adadelta and rmsprop optimisers are able to undo it and learn an identity mapping.
[1] Duvenaud, David, et al. "Avoiding pathologies in very deep networks." arXiv preprint arXiv:1402.5836 (2014).
import breze.learn.mlp as mlp
from climin.initialize import randomize_normal
import climin.stops
from IPython import display
This function plots the 2 dimensional inputs and the 2 dimensional outputs side by side.
def plot(m, x, axs=None):
"""Given model ``m`` and the data ``x``, plot the inputs
and the outputs of the model side by side to a figure."""
if axs is None:
_, axs = plt.subplots(1, 2, figsize=(16, 8))
axs[0].scatter(x[:, 0], X[:, 1], c=C, lw=0, alpha=.2)
y = m.predict(x)
axs[1].scatter(y[:, 0], y[:, 1], c=C, lw=0, alpha=.2)
This function is used to show the mapping and a learning curve during training.
def report(axs, m, x, infos):
axs[0].plot([i['n_iter'] for i in infos], [i['val_loss'] for i in infos])
plot(m, x, axs[1:])
We create the same data set used in Divenaud et al.
X = np.random.standard_normal((5000, 2))
C = np.zeros(X.shape[0])
C[(X[:, 0] > 0) & (X[:, 1] > 0)] = 1
C[(X[:, 0] > 0) & (X[:, 1] < 0)] = 2
C[(X[:, 0] < 0) & (X[:, 1] > 0)] = 3
C[(X[:, 0] < 0) & (X[:, 1] < 0)] = 4
C[(X ** 2).sum(axis=1) < 0.5] = 0
Let's have a look at it.
scatter(X[:, 0], X[:, 1], c=C, lw=0, alpha=.2)
<matplotlib.collections.PathCollection at 0x10a7091d0>
Instantiate a neural net. Feel free to play with the parameters.
n_layers = 10
n_hiddens = [10] * n_layers
transfers = ['tanh'] * n_layers
optimizer = 'rmsprop', {'step_rate': 1e-4, 'momentum': 0.9, 'decay': 0.9}
#optimizer = 'adadelta', {'step_rate': 1, 'momentum': 0.0, 'decay': 0.9, 'offset': 1e-8,}
#optimizer = 'gd', {'step_rate': 1e-2, 'momentum': 0.95, 'momentum_type': 'nesterov'}
m = mlp.Mlp(2, n_hiddens, 2, transfers, out_transfer='identity', loss='squared', optimizer=optimizer, batch_size=100)
We initialise the parameters and visually inspect the map afterwards. It can bee seen that the values can make a huge difference. The right value can preserve the 2D-topology of the input very nicely.
Pick good = False
for a bad initialisation and good = True
for a good one.
good = False
if good:
randomize_normal(m.parameters.data, 0, .2)
else:
randomize_normal(m.parameters.data, 0, 1)
infos = []
_ = plot(m, X)
WARNING (theano.gof.compilelock): Overriding existing lock by dead process '1287' (I am process '1601')
We will train for at least 10'000 updates. Everytime we get a loss that is at least better than the previous losses by some threshold, we will train at least 20% longer. We print out the current loss every few updates.
Note that the training and validation set are the same.
stop = climin.stops.Patience('val_loss', 10000, grow_factor=1.2, threshold=1e-5)
pause = climin.stops.ModuloNIterations(1000)
for info in m.powerfit((X, X), (X, X), stop=stop, report=pause):
print '#updates=%(n_iter)g loss=%(val_loss)g' % info
infos.append(info)
#updates=1000 loss=0.112511 #updates=2000 loss=0.0681373 #updates=3000 loss=0.0516869 #updates=4000 loss=0.0452826 #updates=5000 loss=0.035805 #updates=6000 loss=0.0313041 #updates=7000 loss=0.0265052 #updates=8000 loss=0.0289692 #updates=9000 loss=0.0255052 #updates=10000 loss=0.021712 #updates=11000 loss=0.0190023 #updates=12000 loss=0.0171916 #updates=13000 loss=0.0160145 #updates=14000 loss=0.0134406 #updates=15000 loss=0.0116593 #updates=16000 loss=0.0124089 #updates=17000 loss=0.0130012 #updates=18000 loss=0.0114853 #updates=19000 loss=0.00843668 #updates=20000 loss=0.00858924 #updates=21000 loss=0.00760802 #updates=22000 loss=0.00669069 #updates=23000 loss=0.00720108 #updates=24000 loss=0.00558646 #updates=25000 loss=0.00607991 #updates=26000 loss=0.00480216 #updates=27000 loss=0.00469777 #updates=28000 loss=0.004498 #updates=29000 loss=0.00392877 #updates=30000 loss=0.00410055 #updates=31000 loss=0.00386778 #updates=32000 loss=0.00343123 #updates=33000 loss=0.00343438 #updates=34000 loss=0.00332643 #updates=35000 loss=0.00297991 #updates=36000 loss=0.00290036 #updates=37000 loss=0.00281923 #updates=38000 loss=0.00282879 #updates=39000 loss=0.00239332 #updates=40000 loss=0.00206561 #updates=41000 loss=0.00282746 #updates=42000 loss=0.00223573 #updates=43000 loss=0.00221898 #updates=44000 loss=0.0018559 #updates=45000 loss=0.00179371 #updates=46000 loss=0.00237135 #updates=47000 loss=0.00173972 #updates=48000 loss=0.00159605 #updates=49000 loss=0.00182742 #updates=50000 loss=0.00138037 #updates=51000 loss=0.00126593 #updates=52000 loss=0.00145883 #updates=53000 loss=0.00125663 #updates=54000 loss=0.00144877 #updates=55000 loss=0.00116392 #updates=56000 loss=0.00104769 #updates=57000 loss=0.000921465 #updates=58000 loss=0.00111235 #updates=59000 loss=0.000972217 #updates=60000 loss=0.000887625 #updates=61000 loss=0.00109012 #updates=62000 loss=0.0010329 #updates=63000 loss=0.00106892 #updates=64000 loss=0.000811407 #updates=65000 loss=0.000752235 #updates=66000 loss=0.000734168 #updates=67000 loss=0.000618053 #updates=68000 loss=0.00106308 #updates=69000 loss=0.000816738 #updates=70000 loss=0.000625049 #updates=71000 loss=0.000667977 #updates=72000 loss=0.000657736 #updates=73000 loss=0.000625124 #updates=74000 loss=0.00087072 #updates=75000 loss=0.000641533 #updates=76000 loss=0.000631158 #updates=77000 loss=0.000532158 #updates=78000 loss=0.000736441 #updates=79000 loss=0.000691732 #updates=80000 loss=0.000707017 #updates=81000 loss=0.000699205 #updates=82000 loss=0.000458518 #updates=83000 loss=0.000771827 #updates=84000 loss=0.000485146 #updates=85000 loss=0.000555707 #updates=86000 loss=0.00052005 #updates=87000 loss=0.000701422 #updates=88000 loss=0.000445722 #updates=89000 loss=0.000599542 #updates=90000 loss=0.000486702 #updates=91000 loss=0.000416315 #updates=92000 loss=0.00051225 #updates=93000 loss=0.00085643 #updates=94000 loss=0.000497242 #updates=95000 loss=0.000695418 #updates=96000 loss=0.000461658 #updates=97000 loss=0.000501005 #updates=98000 loss=0.000515209 #updates=99000 loss=0.000431489 #updates=100000 loss=0.000583855 #updates=101000 loss=0.000527047 #updates=102000 loss=0.000400981 #updates=103000 loss=0.000447392 #updates=104000 loss=0.000432716 #updates=105000 loss=0.000531065 #updates=106000 loss=0.000540316 #updates=107000 loss=0.000418358 #updates=108000 loss=0.0004004 #updates=109000 loss=0.000456672 #updates=110000 loss=0.000675781 #updates=111000 loss=0.000518723 #updates=112000 loss=0.000366821 #updates=113000 loss=0.00040002 #updates=114000 loss=0.00031923 #updates=115000 loss=0.000417868 #updates=116000 loss=0.000329971 #updates=117000 loss=0.000451562 #updates=118000 loss=0.000245692 #updates=119000 loss=0.000374911 #updates=120000 loss=0.000744973 #updates=121000 loss=0.000373171 #updates=122000 loss=0.000582818 #updates=123000 loss=0.000439872 #updates=124000 loss=0.000346857 #updates=125000 loss=0.000264465 #updates=126000 loss=0.000222337 #updates=127000 loss=0.00022216 #updates=128000 loss=0.000256153 #updates=129000 loss=0.000291335 #updates=130000 loss=0.00028977
Recover the best parameters.
m.parameters.data[...] = info['best_pars']
Plot what the map is now and the learning curve.
fig, axs = plt.subplots(1, 3, figsize=(15, 5))
report(axs, m, X, infos)
We have shown that the pathology discovered by Duvenaud et al is not an issue for deep neural networks on this toy problem. It is mostly related to bad weight initialisations and recently proposed optimisers seem to overcome the problem.
We cannot extrapolate from this toy problem to more challenging data sets, yet we have not found any evidence that the pathology is an issue for wider nets working on higher dimensional data.
The reader can convince himself whether it is hard to find the initialisations or to tune the optimisers; we have not found so.