# Buffered generators for online learning

Several of the algorithms provided by sklearn are implemented as online learning or stochastic gradient descent methods, which only operate on a small portion of data at any given time. As concrete examples, see sklearn.linear_model.SGDClassifier, sklearn.cluster.MiniBatchKMeans, or sklearn.decomposition.MiniBatchDictionaryLearning.

In principle, these methods should support learning from an endless stream of data. However, in practice, the implementations require contiguous memory allocation (via numpy.array) for data storage. This means that all data must be allocated up-front as a contiguous array, and we cannot simply have a generator supply data examples as needed. As a result, we would be limited by the amount of memory on the machine, even though the learning algorithm only ever operates on a small subset at any given time. As discussed in detail on the Machined Learnings blog, it would be better if data can live out of core until it is needed.

To resolve this issue, we can implement a thin wrapper class that buffers the output from a data generator, and once the buffer is full, feeds it to the estimator via partial_fit.

This notebook provides a brief demonstration of the idea by running k-means on a large collection of audio samples.

The dependencies are as follows:

• sklearn - for obvious reasons
• datastream, muxerator - a particular data generator for randomly multiplexing over files
• librosa - an audio processing library
• BufferedEstimator - the generator buffer class
• To run this demo script, you will need to start ipython notebook in --pylab mode.

--- Brian McFee, 2013-07-10.

In [2]:
# Our core estimator class
from sklearn.cluster import MiniBatchKMeans

# Our feature extraction library
import librosa

# Our data generator class
import glob
from datastream import datastream

# Finally, our generator buffer
from BufferedEstimator import BufferedEstimator


This mapper function is specific to this example. It takes as input a filename (ie, a path to an audio file), and generates n randomly selected samples from the audio file's spectrogram.

In [3]:
def mapper(filename, n=20):
"""Audio frame generator.
Given an input audio file, this generator will yield a random selection of spectrogram frames.

:parameters:
- filename : str
Path to the audio file
- n : int>0
Maximum number of frames to generate

"""

SR         = 22050
N_FFT      = 2048
HOP_LENGTH = 512
N_MELS     = 128
F_MAX      = 8000

S = librosa.feature.melspectrogram(y, sr, n_fft=N_FFT, hop_length=HOP_LENGTH, n_mels=N_MELS, fmax=F_MAX)
return S / S.max()

for i in range(n):
# Grab a random frame
t = np.random.randint(0, S.shape[1])

yield S[:, t]


The list files will be processed by the datastream class as follows:

• datastream will maintain a pool of k=10 active files.
• Each active filename f is converted to a generator by the above mapper function.
• datastream then randomly multiplexes over the k generators.
• When the pool is exhausted (all generators are empty), it is replenished by activating another k random files (without replacement).
• This process repeats until all data has been processed.

If random multiplexing isn't your thing, itertools.chain operates similarly, but exhausts each file sequentially. While chaining is much simpler, it is probably not what you want for stochastic optimization as it introduces a great deal of temporal dependence between the generated examples.

If you're really interested, these wav files can be obtained here.

In [4]:
files = sorted(glob.glob('data/SMC_Mirex/SMC_MIREX_Audio/SMC_0*.wav'))

In [5]:
data_generator = datastream(mapper, files, k=10)


Our estimator is a simple MiniBatchKMeans object. We'll set k=32 for demonstration purposes.

Any sklearn.base.BaseEstimator class will work here, as long as it implements the partial_fit() method.

If your estimator is unsupervised (as we have here), then the generator should yield a single feature vector x at each step.

If your estimator is supervised (eg, SGDClassifier), then the generator should yield a tuple (x, y) at each step, where x is a feature vector and y is its label.

In [6]:
estimator = MiniBatchKMeans(k=32)


The buffered estimator object takes the k-means estimator we constructed above, and a batch_size for the buffer.

Here, we will perform each partial_fit update on a batch of size at most 256.

In [7]:
buf_est = BufferedEstimator(estimator, batch_size=256)


To train the model, we simply call the fit() method on the data generator.

In [8]:
buf_est.fit(data_generator)


After training, we can visualize the results:

In [10]:
figure(figsize=(18,6))
codebook = estimator.cluster_centers_.T

imshow(codebook ** 0.33333333, aspect='auto', interpolation='none', origin='lower')

vlines(np.arange(-.5, codebook.shape[1]-.5), -.5, codebook.shape[0]-.5, colors='w', linestyles='--')
colorbar()
title('Codebook, k=32')
pass