Notebook
What we do first is to show how SOM can cluster RGB vector data, which should be a simple task, if algorithm is valid. We take data vectors as 3 dimensional randints that take values of 0 or 1. Therefore with three bytes we have 8 potential colors. Therefore, the final SOM should show exactly 8 clusters.
We have a set of real data, which are showing the measurements of different pollutants. What we expect in the first step is to see the visual correlation between different pollutants, forexample pm2.5 and pm10
we can either use random values for initializing the som weight vectors or we can linearly initialize them. For that, we first need to calculate eigenvectors of the correlation matrix of the original data. Then, we take two largest eigenvalues and their corresponding eigenvectors and initialize the weight vectors linealy alongside of the SOM map based on these two PCs. For large data size it is better to use Randomized PCA method as it is fast and the results are good. We use RandomizedPCA from scikitlearn package. As we could expect, the first two PCs show which features have mutual correlation, but certainly in a linear way. If SOM training starts based on this PCA based initial values, the training needs less number of iterations.
The deafult training algorithm is Batch training. The learning parameters are selected autamatically. n_job is defining the number cores to be used in parallel. It uses Joblib multiprocessing library from scikitlearn by deafault, there is no shared memory. this means if you add 1 job it increases the amount of memory you need. Then, you can use shared memory based on numpy.memmap. But to be honest, I didn't get better performance and I guess I didn't implement it correctly. The current implementation looks to behave linearly scalable with the size of memory and number of cores, but I have this feeling that my memory allocation is not optimum.
I used an specific data size, since it was used in a map-reduce based implementation of SOM: (http://www.hicomb.org/papers/HICOMB2011-01.pdf) The data size: 81920 X 256 on 50X50 SOM grid The best achieved results in this paper (using 1024 cores) is 4.02 minutes, which doesn't look acceptable, as the data is not that large 1-parallel-Map-reduce SOM: 241.2 seconds 2- Matlab: 69.621502 seconds 3- sompy: 26.1570 seconds
I compared my implementation with Matlab with larger data sizes. For example: 1- data: 200*1000 X 20 on 50X50 SOM grid 1-1- Matlab: 40.590471 seconds 1-2- sompy: 11.6 secodns 2- data: 400*1000 X 20 on 50X50 SOM grid 2-1- Matlab: 81.514382 2-2- sompy: 24.992000 3- data: 800*1000 X 20 on 50X50 3-1- Matlab: 166.681264 seconds 3-2- sompy: 49.350000 secodns 2- data: 200*1000 X 50 on 80X80 SOM grid 2-1- Matlab: 191.710211 seconds 2-2- sompy: 74.932 seconds, CPU gets blocked for a while since memory becomes full! this is the current problem. When I checked the CPU performance, it seems in the begining I have a peak of memory that I can't find it.