Note: if you are visualizing this notebook directly from GitHub, some mathematical symbols might display incorrectly or not display at all. This same notebook can be rendered from nbviewer by following this link.
This project consists on clustering musical artists using a dataset with the top 50 played artists per user of a random sample of ~360,000 users from Last.fm, which can be found here, based on the idea that artists who are preferred by the same users tend to be similar.
import os, sys
def download_data():
import urllib, tarfile
data_file = urllib.URLopener()
data_file.retrieve("http://mtg.upf.edu/static/datasets/last.fm/lastfm-dataset-360K.tar.gz", "lastfm-dataset-360K.tar.gz")
data_file = tarfile.open("lastfm-dataset-360K.tar.gz", 'r:gz')
data_file.extractall()
def generate_sample(file_path=os.getcwd()+'/lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv',n=10000):
with open('small_sample.tsv','w') as out:
with open(file_path) as f:
for i in range(n):
out.write(f.readline())
#download_data()
# generate_sample()
import os, json
#Here I'm assuming there's an Spark Context already set up under the name 'sc'
# dataset=sc.textFile(os.getcwd()+'/small_sample.tsv')
dataset=sc.textFile(os.getcwd()+'/lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv')
dataset=dataset.map(lambda x: x.split('\t')).filter(lambda x: len(x[1])>30) #removing invalid artists (e.g. 'bbc radio')
dataset=dataset.filter(lambda x: x[1]!='89ad4ac3-39f7-470e-963a-56509c546377') #removing 'various artists'
#converting hash values to integers to speed up the analysis
u_list=dataset.map(lambda x: x[0]).distinct().collect()
users=dict()
for v,k in enumerate(u_list):
users[k]=v
a_list=dataset.map(lambda x: x[1]).distinct().collect()
artists=dict()
for v,k in enumerate(a_list):
artists[k]=v
n_users=len(u_list)
n_artists=len(a_list)
dataset=dataset.map(lambda x: (users[x[0]],artists[x[1]],x[2])) #the number of plays is not relevant here
dataset.cache()
del u_list, a_list, users, artists
print("There are ", n_users, " different users")
print("There are ", n_artists, " valid artists")
#Generating some useful files for later
#Artists in this dataset can appear under more than one name
def build_arts_dict(dataset):
return dict(dataset.map(lambda x: (x[1],x[2])).groupByKey().mapValues(list).mapValues(lambda x: x[0]).map(lambda x: [x[0],x[1]]).collect())
arts_dic=build_arts_dict(dataset)
with open('artists.json', 'w') as outfile:
json.dump(arts_dic, outfile)
del arts_dic
dataset.cache()
dataset.map(lambda x: (x[0],x[1])).saveAsTextFile('processed_data')
users_per_artist=dict(dataset.map(lambda x: (x[1],x[0])).groupByKey().mapValues(len).map(list).collect())
with open('users_per_artist.json', 'w') as outfile:
json.dump(users_per_artist, outfile)
del users_per_artist
dataset.unpersist()
There are 359339 different users There are 160162 valid artists
PythonRDD[202] at RDD at PythonRDD.scala:43
Calculating cosine similarities among artists (this penalizes according to the frequencies of artists across users, giving less of a penalty for cases of assimetric frequencies than dice similarity, for example), considering only wheter they are in a user's top played list regardless of the number of plays and only taking into account those with a high-enoguh similarity.
If we assume that each user has roughly 50-top played artists, and each artist has an equal chance of appearing within any given user's playlist, then the expected cosine similarity between two artists, if everything happened by chance, could be approximated like this:
$$expected.sim=\frac{(\frac{50}{n. artists})^2 \times n.users}{\sqrt{50}^2} \approx 0.0007 $$However, since pairs of artists with such similarity would likely not be in the same cluster, it could be a better idea to set an arbitrary threshold instead, so as to decrease the number of pairs. A cosine similarity of 0.1 would be equivalent to two artists appearing each in the playlists of 100 users and having 10 users in common; or in a different case, to an artist appearing in 100 users' playlist and another in 50, having 7 users in common.
In order to decrease the number of artists pairs to evaluate in the clustering part and make it manageable on a single machine, it would be better to set a minimum requirement for common users among artists - here I set it to 7, so a pair of artists should have at least 7 users in common in order for it to be assigned a non-zero distance, otherwise there would be artists assigned to the same cluster only because one user heard both - as well as a threshold for the cosine distance, which I set at 4 times the expected value if everything happened at random, in order for a pair to be considered as having a certain similarity.
This dataset encompasses 360,000 users, each having a list of 50 top played artists, summing up to 160,000 different artists. Trying to establish similarities between all the artists would imply looping through $160,000 \times (160,000-1)/2 \approx 13,000,000,000$ pairs, so it's better to first see which artists have users in common, as of the 13 billion possible pairs, there are very few with at least one user in common. Doing it this way would imply looping over only $360,000 \times 50 \times (50-1)/2 \approx 441,000,000$ pairs, and in the process, it's possible to count how many users do artists have in common to ease further distance calculations.
import itertools
from operator import add
def ordered(pair):
n1,n2=pair
if n2>n1:
return n1,n2
else:
return n2,n1
dataset=sc.textFile('processed_data').map(eval).groupByKey().mapValues(list).map(lambda x: x[1])
dataset=dataset.flatMap(lambda x: [(ordered(i),1) for i in itertools.combinations(x,2)]).reduceByKey(add).map(lambda x: (x[0][0],x[0][1],x[1])).filter(lambda x: x[2]>6)
dataset.cache()
PythonRDD[10] at RDD at PythonRDD.scala:43
from math import sqrt
import json
#Taken from the previous results
n_artists=160162
n_users=359339
threshold=4* ( ((50.0/n_artists)**2)*n_users/(sqrt(50.0)**2) )
with open('users_per_artist.json') as file:
users_per_artist=json.load(file)
users_per_artist={int(k):v for k,v in users_per_artist.items()}
bc_dic=sc.broadcast(users_per_artist)
del users_per_artist
dataset=dataset.map(lambda x: (x[0],x[1],x[2]*1.0/(sqrt(bc_dic.value[x[0]])*sqrt(bc_dic.value[x[1]])))).filter(lambda x: x[2]>threshold)
dataset.cache()
dataset.saveAsTextFile('sims')
print('There are ',dataset.count(),' non-zero pairwise distances')
There are 7607557 non-zero pairwise distances
Here I'll produce different clusterings, using 100, 200, 500, 700 and 1000 clusters usign power iteration clustering, which provides similar (though usually slightly inferior) results to spectral clustering but runs faster and is scalable.
from pyspark.mllib.clustering import PowerIterationClustering as pic
import pandas as pd
import json
# dataset=sc.textFile('sims').map(eval)
# dataset.cache()
n_clusters=100
clusters=pic.train(dataset,n_clusters)
clusters.assignments().map(lambda x: str(x[0])+','+str(x[1])).repartition(1).saveAsTextFile('clusts100')
del clusters
n_clusters=200
clusters=pic.train(dataset,n_clusters)
clusters.assignments().map(lambda x: str(x[0])+','+str(x[1])).repartition(1).saveAsTextFile('clusts200')
del clusters
n_clusters=500
clusters=pic.train(dataset,n_clusters)
clusters.assignments().map(lambda x: str(x[0])+','+str(x[1])).repartition(1).saveAsTextFile('clusts500')
del clusters
n_clusters=700
clusters=pic.train(dataset,n_clusters)
clusters.assignments().map(lambda x: str(x[0])+','+str(x[1])).repartition(1).saveAsTextFile('clusts700')
del clusters
n_clusters=1000
clusters=pic.train(dataset,n_clusters)
clusters.assignments().map(lambda x: str(x[0])+','+str(x[1])).repartition(1).saveAsTextFile('clusts1000')
del clusters
dataset=pd.read_csv('clusts100\part-00000',header=None)
dataset.columns=['artist_id','cluster100']
dataset200=pd.read_csv('clusts200\part-00000',header=None)
dataset200.columns=['artist_id','cluster200']
dataset500=pd.read_csv('clusts500\part-00000',header=None)
dataset500.columns=['artist_id','cluster500']
dataset700=pd.read_csv('clusts700\part-00000',header=None)
dataset700.columns=['artist_id','cluster700']
dataset1000=pd.read_csv('clusts1000\part-00000',header=None)
dataset1000.columns=['artist_id','cluster1000']
dataset=dataset.merge(dataset200,how='outer',on='artist_id').merge(dataset500,how='outer',on='artist_id').merge(dataset700,how='outer',on='artist_id').merge(dataset1000,how='outer',on='artist_id')
with open('artists.json') as art:
artists_dict=json.load(art)
artists_dict={int(k):v for k,v in artists_dict.items()}
dataset['artist_name']=[artists_dict[art] for art in dataset['artist_id']]
dataset.to_csv('results_all.csv',index=False)
del dataset200,dataset500,dataset700,dataset1000
print(dataset.shape[0],' artists were clustered')
dataset.head()
55637 artists were clustered
artist_id | cluster100 | cluster200 | cluster500 | cluster700 | cluster1000 | artist_name | |
---|---|---|---|---|---|---|---|
0 | 43120 | 54 | 56 | 328 | 29 | 976 | sy smith |
1 | 90797 | 74 | 154 | 103 | 466 | 829 | phoebe killdeer and the short straws |
2 | 10290 | 56 | 147 | 273 | 30 | 928 | judge jules |
3 | 17934 | 85 | 78 | 275 | 683 | 853 | strip steve |
4 | 30429 | 49 | 38 | 299 | 672 | 352 | marja mattlar |
Since most artists are in only one or two user's playlists, it's unreliable (and computationally complex) to cluster them with so few data. That's why only a fraction (around one third) of the artists were considered for the clustering process.
Additional: clustering with scikit-learn (dbscan) and igraph (louvain modularity) (both are non-parallel). I chose these parameters and algorithms after some manual experimentation seeing which ones give a reasonable spread of artists across clusters. These algorithms have the nice property of automatically determining the number of clusters.
#Rearranging the data format
import re
dataset=sc.textFile('sims')
dataset=dataset.map(lambda x: re.sub('[\(\)\s]','',x))
dataset.repartition(1).saveAsTextFile('sims_csv')
dataset.unpersist()
del dataset
import sklearn, igraph, scipy, re
import pandas as pd
import sklearn.cluster
dataset=pd.read_csv('sims_csv/part-00000',header=None)
dataset.columns=['art1','art2','sim']
dataset['dist']=[1-i for i in dataset['sim']]
present_artists=set(dataset['art1'].append(dataset['art2']).values.tolist())
new_numer_art_to_int=dict()
new_numer_int_to_art=dict()
count=0
for art in present_artists:
new_numer_art_to_int[art]=count
new_numer_int_to_art[count]=art
count+=1
del present_artists, count
dataset['art1']=[new_numer_art_to_int[i] for i in dataset['art1']]
dataset['art2']=[new_numer_art_to_int[i] for i in dataset['art2']]
I=dataset['art1'].append(dataset['art2'])
J=dataset['art2'].append(dataset['art1'])
V=dataset['dist'].append(dataset['dist'])
dataset_matrix=scipy.sparse.csr_matrix((V,(I,J)))
del I,J,V
dataset_matrix
dbsc=sklearn.cluster.DBSCAN(eps=0.775,metric='precomputed').fit_predict(dataset_matrix)
new_res=pd.Series(range(dataset_matrix.shape[0])).to_frame()
new_res.columns=['artist_id']
new_res['dbsc']=dbsc
del dbsc, dataset_matrix
g=igraph.Graph(edges=dataset[['art1','art2']].values.tolist(),directed=False)
g.es['weight']=dataset['sim'].values.tolist()
del dataset
louvain_weighted=g.community_multilevel(weights=g.es['weight'])
new_res['louvain']=louvain_weighted.membership
new_res['artist_id']=[new_numer_int_to_art[i] for i in new_res['artist_id']]
results=pd.read_csv('results_all.csv',engine='python')
results=results.merge(new_res,how='left',on='artist_id')
new_res=new_res.merge(results[['artist_id','cluster100','cluster200','cluster500','cluster700','cluster1000']],how='left',on='artist_id')
cols=results.columns.tolist()
cols=cols[6:7]+cols[1:6]+cols[7:9]
results=results[cols]
results.columns=[re.sub('cluster','pic',i) for i in results.columns]
new_res.columns=[re.sub('cluster','pic',i) for i in new_res.columns]
results.to_csv('results_all.csv',index=False)
results.head()
artist_name | pic100 | pic200 | pic500 | pic700 | pic1000 | dbsc | louvain | |
---|---|---|---|---|---|---|---|---|
0 | sy smith | 54 | 56 | 328 | 29 | 976 | -1 | 55 |
1 | phoebe killdeer and the short straws | 74 | 154 | 103 | 466 | 829 | -1 | 62 |
2 | judge jules | 56 | 147 | 273 | 30 | 928 | -1 | 55 |
3 | strip steve | 85 | 78 | 275 | 683 | 853 | -1 | 62 |
4 | marja mattlar | 49 | 38 | 299 | 672 | 352 | -1 | 38 |
Note: a cluster assignment of -1 means that the row was not asigned to any cluster. In DBSCAN most of the artists are not assigned to any cluster, thus those clusters should be of better quality.
sizes=[pd.Series(results[i].value_counts()) for i in results.columns[1:]]
sizes[5]=sizes[5][1:]
for i in range(len(sizes)):
sizes[i].index=range(len(sizes[i]))
sizes=pd.DataFrame(sizes).transpose()
sizes.columns=results.columns[1:]
sizes.fillna('')
pic100 | pic200 | pic500 | pic700 | pic1000 | dbsc | louvain | |
---|---|---|---|---|---|---|---|
0 | 1817 | 1013 | 414 | 287 | 287 | 2065 | 13605 |
1 | 1779 | 1005 | 361 | 287 | 196 | 724 | 6314 |
2 | 1752 | 908 | 360 | 271 | 188 | 289 | 6001 |
3 | 1664 | 875 | 350 | 251 | 187 | 261 | 4418 |
4 | 1547 | 762 | 349 | 246 | 183 | 217 | 2518 |
5 | 1507 | 757 | 343 | 244 | 179 | 187 | 2514 |
6 | 1442 | 750 | 337 | 244 | 176 | 177 | 2470 |
7 | 1329 | 749 | 332 | 241 | 174 | 176 | 2224 |
8 | 1324 | 746 | 332 | 237 | 170 | 176 | 1893 |
9 | 1319 | 730 | 332 | 235 | 168 | 169 | 1633 |
10 | 1292 | 724 | 329 | 234 | 168 | 162 | 1552 |
11 | 1204 | 722 | 327 | 229 | 165 | 159 | 1097 |
12 | 1188 | 716 | 326 | 228 | 165 | 158 | 960 |
13 | 1185 | 696 | 324 | 223 | 163 | 147 | 830 |
14 | 1184 | 688 | 324 | 221 | 160 | 144 | 828 |
15 | 1170 | 683 | 321 | 220 | 158 | 144 | 744 |
16 | 1146 | 678 | 319 | 220 | 158 | 137 | 684 |
17 | 1144 | 676 | 316 | 218 | 157 | 133 | 641 |
18 | 1120 | 663 | 314 | 218 | 157 | 120 | 641 |
19 | 1112 | 656 | 312 | 215 | 157 | 111 | 508 |
20 | 1084 | 653 | 311 | 215 | 156 | 101 | 496 |
21 | 972 | 652 | 310 | 214 | 156 | 98 | 367 |
22 | 950 | 645 | 305 | 214 | 155 | 85 | 350 |
23 | 907 | 622 | 304 | 211 | 154 | 84 | 328 |
24 | 890 | 622 | 301 | 209 | 153 | 82 | 302 |
25 | 888 | 615 | 300 | 209 | 149 | 80 | 237 |
26 | 880 | 606 | 300 | 208 | 149 | 80 | 215 |
27 | 788 | 594 | 299 | 208 | 148 | 79 | 208 |
28 | 753 | 592 | 299 | 207 | 148 | 75 | 157 |
29 | 724 | 591 | 292 | 207 | 147 | 72 | 102 |
... | ... | ... | ... | ... | ... | ... | ... |
970 | 2 | ||||||
971 | 2 | ||||||
972 | 2 | ||||||
973 | 2 | ||||||
974 | 2 | ||||||
975 | 2 | ||||||
976 | 2 | ||||||
977 | 2 | ||||||
978 | 2 | ||||||
979 | 2 | ||||||
980 | 2 | ||||||
981 | 2 | ||||||
982 | 2 | ||||||
983 | 1 | ||||||
984 | 1 | ||||||
985 | 1 | ||||||
986 | 1 | ||||||
987 | 1 | ||||||
988 | 1 | ||||||
989 | 1 | ||||||
990 | 1 | ||||||
991 | 1 | ||||||
992 | 1 | ||||||
993 | 1 | ||||||
994 | 1 | ||||||
995 | 1 | ||||||
996 | 1 | ||||||
997 | 1 | ||||||
998 | 1 | ||||||
999 | 1 |
1000 rows × 7 columns
From these results, it can be seen that 1000 clusters was definitely too much, since many artists ended up in their own cluster.
Given the size of this dataset, it's not feasible to calculate typical clustering quality metrics such as the silhouette coefficient or the Dunn index, but some metrics for graph cuts can be used. In this case, I'll use modularity, which can be calculated very efficiently for this dataset. This metric is, however, very sensitive to singleton clusters (clusters of size 1) and favors larger clusters, so in this case it might not be the best decision criteria to see which algorithm did better, but it's a good indicator to have some idea of it. Possible values for modularity range from -0.5 to 1, with more being better. For the case of DBSCAN, however, this metric wouldn't be comparable to other algorithms, since most artists are not assigned to any cluster.
import numpy as np
print('Modularity for Power Iteration Clustering with 100 clusters :',g.modularity(membership=new_res['pic100'],weights=g.es['weight']))
print('Modularity for Power Iteration Clustering with 200 clusters :',g.modularity(membership=new_res['pic200'],weights=g.es['weight']))
print('Modularity for Power Iteration Clustering with 500 clusters :',g.modularity(membership=new_res['pic500'],weights=g.es['weight']),)
print('Modularity for Power Iteration Clustering with 700 clusters :',g.modularity(membership=new_res['pic700'],weights=g.es['weight']))
print('Modularity for Power Iteration Clustering with 1000 clusters :',g.modularity(membership=new_res['pic1000'],weights=g.es['weight']))
print('Modularity for Louvain Modularity (',len(set(louvain_weighted.membership)),'clusters) :',louvain_weighted.modularity)
print()
print("Results for DBSCAN:")
print("Number of clusters: ",len(set(results['dbsc']))-1)
print("Number of artists belonging to a cluster: ",len(results['dbsc'].loc[results['dbsc']!=-1]))
del g, louvain_weighted, new_res, new_numer_art_to_int, new_numer_int_to_art
Modularity for Power Iteration Clustering with 100 clusters : 0.037718345357822605 Modularity for Power Iteration Clustering with 200 clusters : 0.019087964635427532 Modularity for Power Iteration Clustering with 500 clusters : 0.0073897684741175 Modularity for Power Iteration Clustering with 700 clusters : 0.005237515826273008 Modularity for Power Iteration Clustering with 1000 clusters : 0.003573255160752189 Modularity for Louvain Modularity ( 76 clusters) : 0.6073314825719774 Results for DBSCAN: Number of clusters: 343 Number of artists belonging to a cluster: 11867
Graph modularity seems to suggest that, in this case, power iteration clustering did terribly bad compared to louvain modularity, but as mentioned before, this metric might not be the most adequate and it doesn't necessarily mean that the results are invalid. A good knowledge about the music industry and manual examination would be needed to tell this. Morever, the clusters obtained from power iteration clustering also come with different levels of granularity (accordign to the number of clusters).
Here I'll check some of the medium-sized clusters obtained from DBSCAN, which should be of better quality than the ones obtained from the other algorithms, since it only assigned a fraction of them to a cluster.
clusters=results['dbsc'].value_counts()[10:15].index
clusters=[pd.DataFrame(results[['artist_name']].loc[results['dbsc']==i]) for i in clusters]
for i in range(len(clusters)):
clusters[i].index=range(len(clusters[i]))
clusters=clusters[0].merge(clusters[1],left_index=True, right_index=True).merge(clusters[2],left_index=True, right_index=True).merge(clusters[3],left_index=True, right_index=True).merge(clusters[4],left_index=True, right_index=True)
clusters.columns=['cluster'+str(i) for i in range(1,len(clusters.columns)+1)]
clusters
cluster1 | cluster2 | cluster3 | cluster4 | cluster5 | |
---|---|---|---|---|---|
0 | edip akbayram | shreya ghosal | hans albers | ten typ mes | david houston |
1 | silahs?z kuvvet | juggy d | jimmy makulis | bora | vince gill |
2 | muharrem ertas | himesh reshammiya | heino | pijani powietrzem | kitty wells |
3 | ali akbar moradi | r.d.burman | michael heck | koniec ?wiata | buck owens |
4 | rojin | b21 | petra frey | pokahontaz | trace adkins |
5 | cem karaca | suzanne | maria & margot hellwig | lumpex75 | eric church |
6 | emre ayd?n | call | hansi hinterseer | coalition | janie fricke |
7 | mozole mirach | jagjit singh | lale andersen | fu | brooks & dunn |
8 | fuat saka | noor jehan | vikinger | s?owa we krwi | montgomery gentry |
9 | sabahat akkiraz | dr. zeus | nino de angelo | pogotowie seksualne | toby keith |
10 | emrah | vishal bhardwaj | das stoakogler trio | 2cztery7 | lee ann womack |
11 | nazan Öncel | amrinder gill | manuela | strachy na lachy | johnny horton |
12 | agire jiyan | farida khanum | lys assia | hey | connie smith |
13 | ferhat göçer | kamal heer | g. g. anderson | kaliber 44 | mark chesnutt |
14 | athena | spb | uwe busse | kazik staszewski | donna fargo |
15 | grup yorum | sujatha | mara kayser | pezet | pam tillis |
16 | mt | bikram singh | bernd stelter | dezerter | collin raye |
17 | haluk levent | ghulam ali | wencke myhre | pezet-noon | eddy arnold |
18 | mazhar alanson | unni menon | markus becker | all wheel drive | clint black |
19 | sümer ezgü | fuzon | die stoakogler | zeus | jessica andrews |
20 | ajda pekkan | shantanu moitra | klostertaler | kombajn do zbierania kur po wioskach | keith whitley |
21 | fuat | kuldip manak | ireen sheer | muchy | patty loveless |
22 | tolga çandar | chitra | rex gildo | marek grechuta | terri clark |
23 | aynur do?an | udit narayan & alka yagnik | rudi schuricke | cool kids of death | bill anderson |
24 | esengül | srinivas | uta bresan | infekcja | mindy mccready |
25 | nilgül | salim-sulaiman | oliver thomas | konkwista 88 | charley pride |
26 | bulutsuzluk Özlemi | babbu maan | roberto blanco | homomilitia | dwight yoakam |
27 | direc-t | mohit chauhan | freddy quinn | akurat | rascal flatts |
28 | muazzez ersoy | bombay jayashree | alpenrebellen | lady pank | vern gosdin |
29 | erol parlak | ismail darbar | katja ebstein | east west rockers | george jones |
... | ... | ... | ... | ... | ... |
117 | aydilge | a.r. rahman | jonny hill | farben lehre | porter wagoner |
118 | bülent ortaçgil | mithoon | nik p. | awantura | kellie pickler |
119 | Üç nokta bir | gurdas maan | andrea jürgens | indios bravos | charlie rich |
120 | a?k?n nur yengi | shamur | andreas martin | pid?ama porno | webb pierce |
121 | hümeyra | shehzad roy | maxi arland | daab | hank thompson |
122 | gece yolcular? | rajesh roshan | axel becker | izrael | jason aldean |
123 | y?ld?z tilbe | ali haider | kastelruther spatzen | tworzywo sztuczne | doug stone |
124 | mor ve Ötesi | jasbir jassi | die flippers | waglewski fisz emade | faron young |
125 | yeni türkü | wadali brothers | bata illic | grzegorz turnau | sugarland |
126 | serdar ortaç | sunidhi chauhan | matthias reim | peja | jim ed brown |
127 | ?smail hakk? demircio?lu | bally jagpal | peter maffay | happysad | tracy byrd |
128 | nefret | karthik | brunner & brunner | p?omie? 81 | jason michael carroll |
129 | demir demirkan | javed ali | slavko avsenik und seine original oberkrainer | rahim | sonny james |
130 | leman sam | shaan | andrea berg | cymeon x | chris cagle |
131 | hayko cepkin | abida parveen | die zillertaler | april | craig morgan |
132 | yal?n | asha bhosle & kishore kumar | marianne & michael | t.love | bobby bare |
133 | aylin asl?m | stereo nation | jürgen | wwo | billy currington |
134 | fuchs | rishi rich | jürgen drews | btm | rodney atkins |
135 | berdan mardini | sukhwinder singh | margot eskens | bia?a gor?czka | claude king |
136 | yüksek sadakat | vishal - shekhar | ernst mosch und seine original egerländer musi... | sound pollution | jimmy wayne |
137 | pinhani | rdb | die jungen zillertaler | guernica y luno | gene watson |
138 | Çelik | shail hada | angela wiedl | ca?a góra barwinków | keith anderson |
139 | hande yener | amitabh bachchan | original naabtal duo | lilu | reba mcentire |
140 | soner arica | talat mahmood | vico torriani | maleo reggae rockers | johnny lee |
141 | emre altu? | aziz mian | hanne haller | maria peszek | jim reeves |
142 | abluka alarm | neeraj shridhar & tulsi kumar | gus backus | defekt muzgó | jimmy dean |
143 | cengiz Özkan | ali zafar | mickie krause | fisz | randy travis |
144 | bülent ersoy | chitra singh | juliane werding | pih | billie jo spears |
145 | ali ekber cicek | bally sagoo | wenche myhre | dixon37 | dan seals |
146 | betül demir | achanak | truck stop | molesta | lonestar |
147 rows × 5 columns
Trying to interpret the clusters: