Several of the algorithms provided by sklearn
are implemented as online learning or stochastic gradient descent methods, which only operate on a small portion of data at any given time. As concrete examples, see sklearn.linear_model.SGDClassifier
, sklearn.cluster.MiniBatchKMeans
, or sklearn.decomposition.MiniBatchDictionaryLearning
.
In principle, these methods should support learning from an endless stream of data. However, in practice, the implementations require contiguous memory allocation (via numpy.array
) for data storage. This means that all data must be allocated up-front as a contiguous array, and we cannot simply have a generator
supply data examples as needed. As a result, we would be limited by the amount of memory on the machine, even though the learning algorithm only ever operates on a small subset at any given time. As discussed in detail on the Machined Learnings blog, it would be better if data can live out of core until it is needed.
To resolve this issue, we can implement a thin wrapper class that buffers the output from a data generator, and once the buffer is full, feeds it to the estimator via partial_fit
.
This notebook provides a brief demonstration of the idea by running k-means on a large collection of audio samples.
The dependencies are as follows:
--pylab
mode.--- Brian McFee, 2013-07-10.
# Our core estimator class
from sklearn.cluster import MiniBatchKMeans
# Our feature extraction library
import librosa
# Our data generator class
import glob
from datastream import datastream
# Finally, our generator buffer
from BufferedEstimator import BufferedEstimator
This mapper
function is specific to this example. It takes as input a filename (ie, a path to an audio file), and generates n
randomly selected samples from the audio file's spectrogram.
def mapper(filename, n=20):
"""Audio frame generator.
Given an input audio file, this generator will yield a random selection of spectrogram frames.
:parameters:
- filename : str
Path to the audio file
- n : int>0
Maximum number of frames to generate
"""
SR = 22050
N_FFT = 2048
HOP_LENGTH = 512
N_MELS = 128
F_MAX = 8000
def loadspec():
y, sr = librosa.load(filename, sr=SR)
S = librosa.feature.melspectrogram(y, sr, n_fft=N_FFT, hop_length=HOP_LENGTH, n_mels=N_MELS, fmax=F_MAX)
return S / S.max()
S = loadspec()
for i in range(n):
# Grab a random frame
t = np.random.randint(0, S.shape[1])
yield S[:, t]
The list files
will be processed by the datastream
class as follows:
datastream
will maintain a pool of k=10
active files.f
is converted to a generator by the above mapper
function.datastream
then randomly multiplexes over the k
generators.k
random files (without replacement).If random multiplexing isn't your thing, itertools.chain
operates similarly, but exhausts each file sequentially. While chaining is much simpler, it is probably not what you want for stochastic optimization as it introduces a great deal of temporal dependence between the generated examples.
If you're really interested, these wav files can be obtained here.
files = sorted(glob.glob('data/SMC_Mirex/SMC_MIREX_Audio/SMC_0*.wav'))
data_generator = datastream(mapper, files, k=10)
Our estimator is a simple MiniBatchKMeans
object. We'll set k=32
for demonstration purposes.
Any sklearn.base.BaseEstimator
class will work here, as long as it implements the partial_fit()
method.
If your estimator is unsupervised (as we have here), then the generator should yield a single feature vector x
at each step.
If your estimator is supervised (eg, SGDClassifier
), then the generator should yield a tuple (x, y)
at each step, where x
is a feature vector and y
is its label.
estimator = MiniBatchKMeans(k=32)
The buffered estimator object takes the k-means estimator we constructed above, and a batch_size for the buffer.
Here, we will perform each partial_fit
update on a batch of size at most 256.
buf_est = BufferedEstimator(estimator, batch_size=256)
To train the model, we simply call the fit()
method on the data generator.
buf_est.fit(data_generator)
After training, we can visualize the results:
figure(figsize=(18,6))
codebook = estimator.cluster_centers_.T
imshow(codebook ** 0.33333333, aspect='auto', interpolation='none', origin='lower')
vlines(np.arange(-.5, codebook.shape[1]-.5), -.5, codebook.shape[0]-.5, colors='w', linestyles='--')
colorbar()
title('Codebook, k=32')
pass