This is a little experiment that I ran to visualise the startup ecosystem in London, using tools like Laplacian eigenmaps and spectral clustering. It really demonstrates how easy it is to do basic visualisation and clustering on well-behaved data.
I use AngelList's fee API to get some data. Playing with the search functionality I quickly found out that the tag '1695' corresponds to 'London', so most startups which have presence or headquarters in London would have this tag. This is a bit inaccurate, and for example taxi app Hailo is not featured here, but it's good for a demonstration. Below is the very simple code I used to access the API. The only interesting bit is the eventlet module, which allows me to easily make 20 concurrent API calls, speeding the process up significantly.
import simplejson as json
from eventlet import GreenPool
from eventlet.green import urllib2 as urllib2
def get_page(url):
return json.loads(urllib2.urlopen(url).read())
base_url = 'https://api.angel.co/1/tags/1695/startups'
response = get_page(base_url)
last_page = response['last_page']
pool = GreenPool(20)
startups = []
for item in pool.imap(get_page,[base_url+'?page=%d'%(page_num+1) for page_num in range(last_page)]):
startups.extend(item['startups'])
I've filtered out startups which are 'hidden', as these don't have any public information available about them. To make the rest of the analysis more meaningful I also get rid of startups which have less than 2 market tags.
non_hidden = filter(lambda(x):not x['hidden'],startups)
non_hidden = filter(lambda(x):len(x['markets'])>1,non_hidden)
Just checking if the startup I work for, PeerIndex is in the list:
'PeerIndex' in set([item['name'] for item in non_hidden])
True
I'll base my analysis on market tags. AngelList allows startups to enter multiple tags that describe the markets they are targeting, for example 'mobile commerce', 'e-commerce' or 'social networks'. These are entered by humans, and are therefore a little noisy. Nevertheless I found that this is great signal, and contains a lot of useful information.
The following snippet of code transforms the json-like objects returned by AngelList API to a sparse matrix. Possibly I could have used a library like scikit-learn's preprocessing module, but I had this code lying around so I could just reuse this.
import numpy as np
from scipy.sparse import coo_matrix
def sparse_encoding(items):
dictionary = {}
maxid = 0
ii = []
jj = []
v = []
for i,item in enumerate(items):
for tag in item:
j = dictionary.setdefault(tag,maxid)
if j==maxid:
maxid = maxid+1
jj.append(j)
ii.append(i)
v.append(1)
return coo_matrix((v,(ii,jj))),dictionary
The result is a sparse matrix where there's non-zero element at position (i,j) if startup i has the market tag j. There are about 1.7k startups and some 500 market tags in the dataset. Note that that's probably not all startups in London, only ones which have a profile on AngelList with a correct location tag.
data,dictionary = sparse_encoding([set([market['id'] for market in startup['markets']]) for startup in non_hidden])
spy(data.T)
<matplotlib.lines.Line2D at 0x111ff53d0>
To analyse this data I'll compute pairwise affinities between startups. I'll use cosine distance between the binary vectors describing the market tags associated with startups. The code below is a bit inefficient as I convert the sparse matrix to a dense one, but with this much data it didn't make a difference and I couldn't be bothered to use anything more sophisticated. It is possible to compute cosine distances without making your data matrix sparse.
from scipy.spatial.distance import pdist,squareform
A = 1-squareform(pdist(data.todense(),'cosine'))
A[isnan(A)] = 0
spy(A)
xlabel('startups')
ylabel('startups')
title('similarity between startups')
<matplotlib.text.Text at 0x111b630d0>
The result is a sparse matrix, not much structure is visible, because startups are ordered randomly (well, not randomly, based on number of followers on AngelList, but that's random for the purposes of this analysis). Let us reorder startups cleverly so that more structure becomes visible. I'll perform simple Laplacian eigenmaps: compute the graph laplacian L, and compute it's smallest eigenvectors. The second smallest eigenvector will be a good dimension along which to order startups. for more info on Laplacian eigenmaps, I recommend you read the original paper.
from scipy.sparse.linalg import eigsh
Areg = A + 0.02*ones(A.shape)
D = diag(sum(Areg,0))
L = D - Areg
u,v = eigs(L,3,D,which='SM')
order = sorted(range(data.shape[0]),key=lambda(x):v[x,1])
spy(A[order,:][:,order])
<matplotlib.image.AxesImage at 0x14c635ad0>
So now that the startups are ordered by the second smallest eigenvalue of the Laplacian, a cluster structure becomes clearly visible. Let's look at it in more detail, using the second and third smallest eigenvalues to create a map of startups. In the figure below the x axis is the second, y axis is the third smallest eigenvector of the Laplacian, but the meaning of the coordinates is not really relevant. Each dot represents a startup. Startups that are more similar to each other (in terms of cosine distance between their market tags) will be closer to each other on the plot.
plot(v[:,1],v[:,2],'.')
[<matplotlib.lines.Line2D at 0x14c65d090>]
Clearly, a number of clusters emerge. Let's run clustering on the data and see if it identifies these clusters. I'll use SpectralClustering from the scikit-learn package, it's really simple.
from sklearn.cluster import *
clustering = SpectralClustering(affinity='precomputed',n_clusters = 7)
labels = clustering.fit_predict(Areg)
scatter(v[:,1],v[:,2],c=labels)
<matplotlib.collections.PathCollection at 0x111e2ea50>
When we colour the dots according to the cluster labels spectral clustering found, we see that it roughly identified the clusters that were visible anyway. It does not look perfect, but that's because of the imperfect visualisation of the graph: Laplacian eigenmaps only gives us an approximate look at the structure of the data. Below I list the clusters that spectral clustering found, along with the 3 most popular market tags in each cluster and 10 examples of startups for each.
from collections import Counter
for l in range(7):
cnt = Counter([market['name'] for startup,label in zip(non_hidden,labels) if label==l for market in startup['markets']])
print 'Cluster %d, (%s)'%(l,', '.join([item[0] for item in cnt.most_common(3)]))
for name,markets in [(startup['name'],[market['name'] for market in startup['markets']]) for startup,label in zip(non_hidden,labels) if label==l][:10]:
print '\t%s:\t\t%s'%(name,markets)
Cluster 0, (SaaS, enterprise software, mobile) Popcorn Metrics: ['SaaS', 'analytics', 'marketplaces', 'predictive analytics'] Nirit: ['SaaS', 'location based services', 'private social networking'] Ometria: ['SaaS', 'e-commerce', 'retail technology', 'big data analytics'] Task Tub: ['SaaS', 'project management'] UCi2i: ['enterprise software', 'collaboration', 'small and medium businesses', 'unifed communications'] Collider13: ['SaaS', 'B2B', 'marketplaces', 'big data'] Deontics : ['SaaS', 'health care', 'big data', 'artificial intelligence'] Slivers-of-Time: ['SaaS', 'ventures for good', 'recruiting', 'human resources'] Beem: ['mobile', 'SaaS', 'enterprise software', 'B2B'] Receipt Bank: ['SaaS', 'accounting', 'document management'] Cluster 1, (education, marketplaces, advertising) B2BWave: ['B2B', 'wholesale', 'trading'] Yumbles: ['food and beverages', 'specialty foods', 'organic food'] Agonyapp: ['q&a', 'women', 'private social networking', 'lifestyle'] Echobox: ['analytics', 'artificial intelligence'] Epicum Films: ['film', 'entertainment industry', 'film distribution'] Trip Cubes: ['publishing', 'hotels', 'social media marketing', 'travel & tourism'] Techspace London: ['startups', 'coworking', 'office space'] VICTORIA SPRUCE: ['fashion', 'product design', 'design', 'shoes'] InspiredChallenge: ['online travel', 'social travel', 'adventure travel', 'travel & tourism'] Weknowvid: ['internet tv', 'video streaming'] Cluster 2, (e-commerce, fashion, marketplaces) COS Ventures: ['e-commerce', 'venture capital'] Clothfusion: ['e-commerce', 'retail', 'fashion'] Rare Pink: ['e-commerce', 'retail', 'jewelry'] Baskettt: ['e-commerce', 'personal finance'] Qikd: ['e-commerce', 'retail', 'mobile commerce', 'coupons'] awesome.bi: ['e-commerce', 'consulting', 'retail technology', 'big data analytics'] VirtualOS: ['enterprise software', 'e-commerce', 'virtual workforces', 'office space'] Stylect: ['mobile', 'e-commerce', 'fashion', 'mobile commerce'] Teeleap: ['e-commerce', 'retail', 'brand marketing', 'crowdfunding'] WonderLuk: ['e-commerce', 'fashion', '3d printing'] Cluster 3, (mobile, location based services, education) Feerce: ['mobile', 'location based services', 'messaging'] ArtRate: ['mobile', 'art', 'events', 'contests'] Angie Fiber Communications: ['mobile', 'telecommunications', 'wireless', 'optical communications'] Survey On Tablet Ltd: ['mobile', 'android', 'tablets', 'customer service'] ToDone: ['mobile', 'location based services', 'task management'] InstaSpa : ['mobile', 'personal health', 'health and wellness'] Orin: ['mobile', 'music', 'music services', 'musicians'] Lowdownapp Ltd: ['mobile', 'B2B'] Redeemia: ['mobile', 'mobile commerce', 'deals', 'discounts'] Veleza: ['mobile', 'reviews and recommendations', 'cosmetics', 'big data analytics'] Cluster 4, (social media, mobile, social commerce) cardkiwi.com: ['social media', 'education', 'k 12 education'] TAG Education: ['social media', 'clean technology', 'education', 'k 12 education'] Foundbite (by Mendzapp): ['mobile', 'social media', 'photo sharing'] dubble : ['mobile', 'social media', 'photography', 'private social networking'] Footballtracker: ['social media', 'sports', 'soccer'] Clipling: ['social media', 'news', 'social bookmarking', 'social news'] SayPublic: ['social media', 'market research'] Styloola: ['social media', 'fashion', 'retail technology'] The Woo Game: ['social media', 'online dating'] X-Ray: ['social media', 'crowdsourcing', 'browser extensions', 'content discovery'] Cluster 5, (digital media, social media, mobile) Temple Bright LLP: ['digital media', 'e-commerce', 'startups', 'legal'] Plaything: ['digital media', 'advertising', 'social media platforms', 'user experience design'] Southern Cloud: ['digital media', 'video', 'video streaming', 'hardware + software'] TeenPoke, Inc.: ['digital media', 'social media', 'teenagers', 'social network media'] Bizarre Dragon: ['digital media', 'kids', 'mobile games', 'digital entertainment'] Animal Vegetable Mineral Limited: ['digital media', 'mobile games', 'social television'] NeotericUK Pvt Ltd: ['digital media', 'e-commerce', 'SEO', 'web design'] Exclusive Sports Media Global LTD: ['digital media', 'sports', 'brand marketing', 'social news'] Drivn: ['mobile', 'digital media', 'local', 'quantified self'] Lobster : ['digital media', 'social media', 'marketplaces', 'crowdsourcing'] Cluster 6, (financial services, finance technology, finance) diliger: ['financial services', 'investment management', 'trading'] RainMaker: ['investment management', 'finance', 'finance technology'] Crowd Mortgage Ltd: ['real estate', 'financial services', 'peer-to-peer'] The City PA : ['financial services', 'startups'] Paper Street: ['financial services', 'peer-to-peer', 'finance', 'crowdfunding'] Kensington & Chelsea Investment Project: ['financial services', 'business development'] chronext.com: ['financial services', 'e-commerce', 'trading', 'curated web'] Alpha R E Investment LLP: ['financial services', 'location based services', 'private social networking', 'angel investing'] Proplend | Secured P2P Lending: ['financial services', 'peer-to-peer', 'financial exchanges', 'commercial real estate'] Space Miles Holdings Ltd: ['financial services', 'adventure travel']
So that's it. It really wasn't hard. In comparison, doing the same thing over twitter users was considerably harder - AngelList data is fairly well behaved.