What we do first is to show how SOM can cluster RGB vector data, which should be a simple task, if algorithm is valid. We take data vectors as 3 dimensional randints that take values of 0 or 1. Therefore with three bytes we have 8 potential colors. Therefore, the final SOM should show exactly 8 clusters. %reset -f import numpy as np import sompylib.som_structure as SOM from matplotlib import pyplot as plt import sys msz0 = 50 msz1 = 50 cd = msz0*msz1*1*1 dlen = 100*1000*1*1*1#+224 dim = 3 Data = np.random.randint(0,2,size = (dlen,dim)) # Data = np.random.rand(dlen,dim) reload(sys.modules['sompylib.som_structure']) sm = SOM.SOM('sm', Data, mapsize = [msz0, msz1],norm_method = 'var',initmethod='pca') sm.train(n_job = 2, shared_memory = 'no') tmp = np.zeros((msz0,msz1,dim)) codebook = getattr(sm,'codebook') codebook = SOM.denormalize_by(Data,codebook) # codebook = SOM.denormalize(Data, codebook) for i in range (codebook.shape[1]): tmp[:,:,i] = codebook[:, i].reshape(msz0,msz1) from matplotlib import pyplot as plt tmp.shape fig = plt.imshow(tmp[:,:,0:3]) We have a set of real data, which are showing the measurements of different pollutants. What we expect in the first step is to see the visual correlation between different pollutants, forexample pm2.5 and pm10 %reset -f from numpy import genfromtxt, savetxt import numpy as np #pollution data Data = genfromtxt(open('data/pollution.csv','r'),dtype=float, delimiter=',')[1:] Labels = Data[:,0] Data = Data[:,1:] header= genfromtxt(open('data/pollution.csv','r'),delimiter=',',dtype = None)[0] header = header[1:] header = header[np.newaxis,:] print 'size of data set: ', Data.shape # %reset -f import sys import numpy as np import sompylib.som_structure as SOM from matplotlib import pyplot as plt msz0 = 50 msz1 = 50 cd = msz0*msz1*1*1 dlen = 100*1000*1*1*1#+224 dim = 3 from pandas.tools.plotting import scatter_matrix from pandas import Series, DataFrame import pandas as pd df = DataFrame(data = Data[1:1000,:], columns= header.T) fig = scatter_matrix(df, alpha=0.2, figsize=(10, 10), diagonal='kde') we can either use random values for initializing the som weight vectors or we can linearly initialize them. For that, we first need to calculate eigenvectors of the correlation matrix of the original data. Then, we take two largest eigenvalues and their corresponding eigenvectors and initialize the weight vectors linealy alongside of the SOM map based on these two PCs. For large data size it is better to use Randomized PCA method as it is fast and the results are good. We use RandomizedPCA from scikitlearn package. As we could expect, the first two PCs show which features have mutual correlation, but certainly in a linear way. If SOM training starts based on this PCA based initial values, the training needs less number of iterations. reload(sys.modules['sompylib.som_structure']) sm = SOM.SOM('sm', Data, mapsize = [msz0, msz1],norm_method = 'var',initmethod='pca') sm.init_map() setattr(sm, 'compname', header) sm.view_map(which_dim = 'all') The deafult training algorithm is Batch training. The learning parameters are selected autamatically. n_job is defining the number cores to be used in parallel. It uses Joblib multiprocessing library from scikitlearn by deafault, there is no shared memory. this means if you add 1 job it increases the amount of memory you need. Then, you can use shared memory based on numpy.memmap. But to be honest, I didn't get better performance and I guess I didn't implement it correctly. The current implementation looks to behave linearly scalable with the size of memory and number of cores, but I have this feeling that my memory allocation is not optimum. sm.train(n_job = 1, shared_memory = 'no',verbose='on') sm.view_map(which_dim = 'all') sm.view_map(which_dim= 'all' , pack='Yes',text_size=6,save='No',save_dir='') sm.hit_map() I used an specific data size, since it was used in a map-reduce based implementation of SOM: (http://www.hicomb.org/papers/HICOMB2011-01.pdf) The data size: 81920 X 256 on 50X50 SOM grid The best achieved results in this paper (using 1024 cores) is 4.02 minutes, which doesn't look acceptable, as the data is not that large 1-parallel-Map-reduce SOM: 241.2 seconds 2- Matlab: 69.621502 seconds 3- sompy: 26.1570 seconds %reset -f import numpy as np import sompylib.som_structure as SOM from matplotlib import pyplot as plt import sys msz0 = 50 msz1 = 50 cd = msz0*msz1*1*1 dlen = 81920*1*1*1*1#+224 # dlen = 200*1000 dim = 256 Data = np.random.randint(0,2,size = (dlen,dim)) reload(sys.modules['sompylib.som_structure']) sm = SOM.SOM('sm', Data, mapsize = [msz0, msz1],norm_method = 'var') sm.train() I compared my implementation with Matlab with larger data sizes. For example: 1- data: 200*1000 X 20 on 50X50 SOM grid 1-1- Matlab: 40.590471 seconds 1-2- sompy: 11.6 secodns 2- data: 400*1000 X 20 on 50X50 SOM grid 2-1- Matlab: 81.514382 2-2- sompy: 24.992000 3- data: 800*1000 X 20 on 50X50 3-1- Matlab: 166.681264 seconds 3-2- sompy: 49.350000 secodns 2- data: 200*1000 X 50 on 80X80 SOM grid 2-1- Matlab: 191.710211 seconds 2-2- sompy: 74.932 seconds, CPU gets blocked for a while since memory becomes full! this is the current problem. When I checked the CPU performance, it seems in the begining I have a peak of memory that I can't find it. %reset -f from numpy import genfromtxt, savetxt import numpy as np #pollution data Data = genfromtxt(open('data/pollution.csv','r'),dtype=float, delimiter=',')[1:] Labels = Data[:,0] Data = Data[:,1:] header= genfromtxt(open('data/pollution.csv','r'),delimiter=',',dtype = None)[0] header = header[1:] header = header[np.newaxis,:] indnan = np.isnan(Data) indrem = list() print Data.shape for i in range(Data.shape[0]): if ~np.any(indnan[i,:]): indrem.append(i) Data = Data[indrem] print Data.shape import sys import numpy as np import sompylib.som_structure as SOM from matplotlib import pyplot as plt msz0 = 50 msz1 = 50 reload(sys.modules['sompylib.som_structure']) sm = SOM.SOM('sm', Data[1:48*1000], mapsize = [msz0, msz1],norm_method = 'var') sm.train(n_job = 1, shared_memory = 'no') setattr(sm, 'compname', header) sm.view_map(which_dim = 'all') from pandas.tools.plotting import scatter_matrix from pandas import Series, DataFrame import pandas as pd data = Data[48*1000:58*1000] Target = 8 print 'Variable to predict: ', header[0][Target] pred = sm.predict_by(data,Target, K =1) real = data[:,Target] accuracy = (1-np.abs((pred-real)/real))*100 print 'median accuracy', np.median(accuracy) print 'mean accuracy', np.mean(accuracy) print 'std accuracy', np.std(accuracy) print 'min accuracy', np.min(accuracy) print 'max accuracy', np.max(accuracy) DF = DataFrame({'True Value': real[1:100], 'Predicted Value':pred[1:100]}) fig = plt.figure(); DF.plot(DF.index,DF.columns[:],label=header[0][Target],colormap='jet',x_compat=True,style='.-'); plt.legend(loc='best',bbox_to_anchor = (1.0, 1.0),fontsize = 'medium') plt.ylabel('values') font = {'size' : 12} plt.rc('font', **font) fig.set_size_inches(10,10)