Several of the algorithms provided by
sklearn are implemented as online learning or stochastic gradient descent methods, which only operate on a small portion of data at any given time. As concrete examples, see
In principle, these methods should support learning from an endless stream of data. However, in practice, the implementations require contiguous memory allocation (via
numpy.array) for data storage. This means that all data must be allocated up-front as a contiguous array, and we cannot simply have a
generator supply data examples as needed. As a result, we would be limited by the amount of memory on the machine, even though the learning algorithm only ever operates on a small subset at any given time. As discussed in detail on the Machined Learnings blog, it would be better if data can live out of core until it is needed.
To resolve this issue, we can implement a thin wrapper class that buffers the output from a data generator, and once the buffer is full, feeds it to the estimator via
This notebook provides a brief demonstration of the idea by running k-means on a large collection of audio samples.
The dependencies are as follows:
--- Brian McFee, 2013-07-10.
# Our core estimator class from sklearn.cluster import MiniBatchKMeans # Our feature extraction library import librosa # Our data generator class import glob from datastream import datastream # Finally, our generator buffer from BufferedEstimator import BufferedEstimator
mapper function is specific to this example. It takes as input a filename (ie, a path to an audio file), and generates
n randomly selected samples from the audio file's spectrogram.
def mapper(filename, n=20): """Audio frame generator. Given an input audio file, this generator will yield a random selection of spectrogram frames. :parameters: - filename : str Path to the audio file - n : int>0 Maximum number of frames to generate """ SR = 22050 N_FFT = 2048 HOP_LENGTH = 512 N_MELS = 128 F_MAX = 8000 def loadspec(): y, sr = librosa.load(filename, sr=SR) S = librosa.feature.melspectrogram(y, sr, n_fft=N_FFT, hop_length=HOP_LENGTH, n_mels=N_MELS, fmax=F_MAX) return S / S.max() S = loadspec() for i in range(n): # Grab a random frame t = np.random.randint(0, S.shape) yield S[:, t]
files will be processed by the
datastream class as follows:
datastreamwill maintain a pool of
fis converted to a generator by the above
datastreamthen randomly multiplexes over the
krandom files (without replacement).
If random multiplexing isn't your thing,
itertools.chain operates similarly, but exhausts each file sequentially. While chaining is much simpler, it is probably not what you want for stochastic optimization as it introduces a great deal of temporal dependence between the generated examples.
If you're really interested, these wav files can be obtained here.
files = sorted(glob.glob('data/SMC_Mirex/SMC_MIREX_Audio/SMC_0*.wav'))
data_generator = datastream(mapper, files, k=10)
Our estimator is a simple
MiniBatchKMeans object. We'll set
k=32 for demonstration purposes.
sklearn.base.BaseEstimator class will work here, as long as it implements the
If your estimator is unsupervised (as we have here), then the generator should yield a single feature vector
x at each step.
If your estimator is supervised (eg,
SGDClassifier), then the generator should yield a tuple
(x, y) at each step, where
x is a feature vector and
y is its label.
estimator = MiniBatchKMeans(k=32)
The buffered estimator object takes the k-means estimator we constructed above, and a batch_size for the buffer.
Here, we will perform each
partial_fit update on a batch of size at most 256.
buf_est = BufferedEstimator(estimator, batch_size=256)
To train the model, we simply call the
fit() method on the data generator.
After training, we can visualize the results:
figure(figsize=(18,6)) codebook = estimator.cluster_centers_.T imshow(codebook ** 0.33333333, aspect='auto', interpolation='none', origin='lower') vlines(np.arange(-.5, codebook.shape-.5), -.5, codebook.shape-.5, colors='w', linestyles='--') colorbar() title('Codebook, k=32') pass