Notebook

SOMPY package¶

Vahid Moosavi, 16 August 2014¶

www.vahidmoosavi.com¶

CAAD ETH Zurich¶

Prof. Ludger Hovestadt¶

Future Cities Lab¶

SOMPY ----> Self Organizing Map in Python¶

Validating SOM performance by training based on RGB vectors¶

What we do first is to show how SOM can cluster RGB vector data, which should be a simple task, if algorithm is valid. We take data vectors as 3 dimensional randints that take values of 0 or 1. Therefore with three bytes we have 8 potential colors. Therefore, the final SOM should show exactly 8 clusters.

In [1]:

%reset -f
import numpy as np
import sompylib.som_structure as SOM
from matplotlib import pyplot as plt
import sys
msz0 = 50
msz1 = 50
cd = msz0*msz1*1*1
dlen = 100*1000*1*1*1#+224
dim = 3
Data = np.random.randint(0,2,size = (dlen,dim))
# Data = np.random.rand(dlen,dim)

In [3]:

reload(sys.modules['sompylib.som_structure'])
sm = SOM.SOM('sm', Data, mapsize = [msz0, msz1],norm_method = 'var',initmethod='pca')

In [4]:

sm.train(n_job = 2, shared_memory = 'no')

initialization method = pca, initializing..

initialization done in 0.276000 seconds

rough training...
radius_ini: 7.000000 , radius_final: 1.750000, trainlen: 1

epoch: 1 ---> elapsed time:  2.713000, quantization error: 1.005199 


finetune training...
radius_ini: 1.750000 , radius_final: 1.000000, trainlen: 1

epoch: 1 ---> elapsed time:  2.367000, quantization error: 0.092257 

Total time elapsed: 5.730000 secodns
final quantization error: 0.092257

In [5]:

tmp = np.zeros((msz0,msz1,dim))
codebook = getattr(sm,'codebook')
codebook = SOM.denormalize_by(Data,codebook)
# codebook = SOM.denormalize(Data, codebook)
for i in range (codebook.shape[1]):
    tmp[:,:,i] = codebook[:, i].reshape(msz0,msz1)
from matplotlib import pyplot as plt
tmp.shape
fig = plt.imshow(tmp[:,:,0:3])

Training based on a real data set¶

We have a set of real data, which are showing the measurements of different pollutants. What we expect in the first step is to see the visual correlation between different pollutants, forexample pm2.5 and pm10

In [6]:

%reset -f
from numpy import genfromtxt, savetxt
import numpy as np
#pollution data
Data = genfromtxt(open('data/pollution.csv','r'),dtype=float, delimiter=',')[1:]
Labels = Data[:,0]
Data = Data[:,1:]
header= genfromtxt(open('data/pollution.csv','r'),delimiter=',',dtype = None)[0]
header = header[1:]
header = header[np.newaxis,:]

print 'size of data set: ', Data.shape
# %reset -f
import sys
import numpy as np
import sompylib.som_structure as SOM
from matplotlib import pyplot as plt
msz0 = 50
msz1 = 50
cd = msz0*msz1*1*1
dlen = 100*1000*1*1*1#+224
dim = 3

size of data set:  (58410, 16)

Traditional Methods for high dimensional data analysis¶

In [34]:

from pandas.tools.plotting import scatter_matrix
from pandas import Series, DataFrame
import pandas as pd
df = DataFrame(data = Data[1:1000,:], columns= header.T)

fig = scatter_matrix(df, alpha=0.2, figsize=(10, 10), diagonal='kde')

Initializing the Map¶

Linear analysis, based on first two PCA¶

we can either use random values for initializing the som weight vectors or we can linearly initialize them. For that, we first need to calculate eigenvectors of the correlation matrix of the original data. Then, we take two largest eigenvalues and their corresponding eigenvectors and initialize the weight vectors linealy alongside of the SOM map based on these two PCs. For large data size it is better to use Randomized PCA method as it is fast and the results are good. We use RandomizedPCA from scikitlearn package. As we could expect, the first two PCs show which features have mutual correlation, but certainly in a linear way. If SOM training starts based on this PCA based initial values, the training needs less number of iterations.

In [7]:

reload(sys.modules['sompylib.som_structure'])
sm = SOM.SOM('sm', Data, mapsize = [msz0, msz1],norm_method = 'var',initmethod='pca')
sm.init_map()
setattr(sm, 'compname', header)
sm.view_map(which_dim = 'all')

Training¶

The deafult training algorithm is Batch training. The learning parameters are selected autamatically. n_job is defining the number cores to be used in parallel. It uses Joblib multiprocessing library from scikitlearn by deafault, there is no shared memory. this means if you add 1 job it increases the amount of memory you need. Then, you can use shared memory based on numpy.memmap. But to be honest, I didn't get better performance and I guess I didn't implement it correctly. The current implementation looks to behave linearly scalable with the size of memory and number of cores, but I have this feeling that my memory allocation is not optimum.

In [8]:

sm.train(n_job = 1, shared_memory = 'no',verbose='on')

initialization method = pca, initializing..

initialization done in 0.210000 seconds

rough training...
radius_ini: 7.000000 , radius_final: 1.750000, trainlen: 1

epoch: 1 ---> elapsed time:  2.102000, quantization error: 2.186011 


finetune training...
radius_ini: 1.750000 , radius_final: 1.000000, trainlen: 2

epoch: 1 ---> elapsed time:  1.508000, quantization error: 1.888593 

epoch: 2 ---> elapsed time:  1.564000, quantization error: 1.392114 

Total time elapsed: 5.891000 secodns
final quantization error: 1.392114

SOM visualization¶

In [9]:

sm.view_map(which_dim = 'all')

In [13]:

sm.view_map(which_dim= 'all' , pack='Yes',text_size=6,save='No',save_dir='')

In [14]:

sm.hit_map()

A comparison on larger data sets using parallel-processing and shared memory¶

I used an specific data size, since it was used in a map-reduce based implementation of SOM: (http://www.hicomb.org/papers/HICOMB2011-01.pdf) The data size: 81920 X 256 on 50X50 SOM grid The best achieved results in this paper (using 1024 cores) is 4.02 minutes, which doesn't look acceptable, as the data is not that large 1-parallel-Map-reduce SOM: 241.2 seconds 2- Matlab: 69.621502 seconds 3- sompy: 26.1570 seconds

In [51]:

%reset -f
import numpy as np
import sompylib.som_structure as SOM
from matplotlib import pyplot as plt
import sys
msz0 = 50
msz1 = 50
cd = msz0*msz1*1*1
dlen = 81920*1*1*1*1#+224
# dlen = 200*1000
dim = 256
Data = np.random.randint(0,2,size = (dlen,dim)) 

In [52]:

reload(sys.modules['sompylib.som_structure'])
sm = SOM.SOM('sm', Data, mapsize = [msz0, msz1],norm_method = 'var')
sm.train()

initialization method = pca, initializing..

initialization done in 0.744000 seconds

rough training...
radius_ini: 7.000000 , radius_final: 1.750000, trainlen: 1

epoch: 1 ---> elapsed time:  6.392000, quantization error: 15.944344 


finetune training...
radius_ini: 1.750000 , radius_final: 1.000000, trainlen: 2

epoch: 1 ---> elapsed time:  6.824000, quantization error: 15.935956 

epoch: 2 ---> elapsed time:  7.408000, quantization error: 15.821423 


Total time elapsed: 21.913000 secodns
final quantization error: 15.821423

I compared my implementation with Matlab with larger data sizes. For example: 1- data: 200*1000 X 20 on 50X50 SOM grid 1-1- Matlab: 40.590471 seconds 1-2- sompy: 11.6 secodns 2- data: 400*1000 X 20 on 50X50 SOM grid 2-1- Matlab: 81.514382 2-2- sompy: 24.992000 3- data: 800*1000 X 20 on 50X50 3-1- Matlab: 166.681264 seconds 3-2- sompy: 49.350000 secodns 2- data: 200*1000 X 50 on 80X80 SOM grid 2-1- Matlab: 191.710211 seconds 2-2- sompy: 74.932 seconds, CPU gets blocked for a while since memory becomes full! this is the current problem. When I checked the CPU performance, it seems in the begining I have a peak of memory that I can't find it.

Next Steps¶

1- Clustering¶

1-1- Based on U-Matrix¶

1-2- Based on K-Means¶

2- Prediction and Function approximation¶

Rational Modeling Vs. Modeling by SOM: (both are data-driven methods)¶

2-1- based on KNN¶

Importing Data¶

In [53]:

%reset -f
from numpy import genfromtxt, savetxt
import numpy as np
#pollution data
Data = genfromtxt(open('data/pollution.csv','r'),dtype=float, delimiter=',')[1:]
Labels = Data[:,0]
Data = Data[:,1:]
header= genfromtxt(open('data/pollution.csv','r'),delimiter=',',dtype = None)[0]
header = header[1:]
header = header[np.newaxis,:]
indnan = np.isnan(Data)
indrem = list()
print Data.shape
for i in range(Data.shape[0]):
    if ~np.any(indnan[i,:]):
        indrem.append(i)
Data = Data[indrem]
print Data.shape

(58410, 16)
(58410, 16)

Training SOM¶

In [60]:

import sys
import numpy as np
import sompylib.som_structure as SOM
from matplotlib import pyplot as plt
msz0 = 50
msz1 = 50
reload(sys.modules['sompylib.som_structure'])
sm = SOM.SOM('sm', Data[1:48*1000], mapsize = [msz0, msz1],norm_method = 'var')
sm.train(n_job = 1, shared_memory = 'no')

initialization method = pca, initializing..

initialization done in 0.195000 seconds

rough training...
radius_ini: 7.000000 , radius_final: 1.750000, trainlen: 1

epoch: 1 ---> elapsed time:  1.241000, quantization error: 2.161003 


finetune training...
radius_ini: 1.750000 , radius_final: 1.000000, trainlen: 3

epoch: 1 ---> elapsed time:  1.458000, quantization error: 1.850186 

epoch: 2 ---> elapsed time:  1.250000, quantization error: 1.328427 

epoch: 3 ---> elapsed time:  1.303000, quantization error: 1.052059 


Total time elapsed: 6.115000 secodns
final quantization error: 1.052059

In [61]:

setattr(sm, 'compname', header)
sm.view_map(which_dim = 'all')

Testing with New Data¶

In [115]:

from pandas.tools.plotting import scatter_matrix
from pandas import Series, DataFrame
import pandas as pd


data = Data[48*1000:58*1000]
Target = 8
print 'Variable to predict: ', header[0][Target]
pred = sm.predict_by(data,Target, K =1)
real = data[:,Target]
accuracy = (1-np.abs((pred-real)/real))*100
print 'median accuracy', np.median(accuracy)
print 'mean accuracy', np.mean(accuracy)
print 'std accuracy', np.std(accuracy)
print 'min accuracy', np.min(accuracy)
print 'max accuracy', np.max(accuracy)
DF = DataFrame({'True Value': real[1:100], 'Predicted Value':pred[1:100]})
fig = plt.figure(); 
DF.plot(DF.index,DF.columns[:],label=header[0][Target],colormap='jet',x_compat=True,style='.-'); plt.legend(loc='best',bbox_to_anchor = (1.0, 1.0),fontsize = 'medium')
plt.ylabel('values')
font = {'size'   : 12}
plt.rc('font', **font)
fig.set_size_inches(10,10)

Variable to predict:  B-PM25-ug-m3
median accuracy 96.0835761262
mean accuracy 93.8074000134
std accuracy 8.90422108991
min accuracy -93.410584713
max accuracy 99.9989308307