Welcome to the Santa Barbara Data Science Meetup

Thanks to our speaker, host, sponsors, and facilitators!

In [28]:
Image(url='http://it-ebooks.info/images/ebooks/3/agile_data_science.jpg')
Out[28]:
In [14]:
from IPython.display import Image
Image(url='http://www.cloudpointasia.com/var/site/storage/images/media/images/site-images/logos/citrixonline_logo_web/96233-1-eng-GB/citrixonline_logo_web.gif')
Out[14]:
In [15]:
Image(url='http://www.theactivityexchange.com/images/logo_small.png')
Out[15]:

The Santa Barbara R Users Group:

http://www.meetup.com/Santa-Barbara-R-Users-Group/

What is Data Science?

[...] the “data scientist.” It’s a high-ranking professional with the training and curiosity to make discoveries in the world of big data. The title has been around for only a few years. (It was coined in 2008 by one of us, D.J. Patil, and Jeff Hammerbacher, then the respective leads of data and analytics efforts at LinkedIn and Facebook.)

  • Something that should be done by statisticians, but it isn't

As the price of data dropped so dramatically over the last two decades, the division of labor between analysts and everyone else became less and less clear. Data became so cheap that it couldn’t be confined to just a few highly trained people.

OK, it's trendy and misused... but what is it, really?

In [16]:
# The Data Science Venn Diagram (1.0) — Drew Conway
Image(url='http://static.squarespace.com/static/5150aec6e4b0e340ec52710a/t/51525c33e4b0b3e0d10f77ab/1364352052403/Data_Science_VD.png?format=750w')
Out[16]:
In [17]:
# Venn Diagram 2.0, Steven Geringer
Image('http://2.bp.blogspot.com/-Qi-0utjhySM/UsteLrV6NyI/AAAAAAAACNQ/AdkizQfS8l8/s1600/moz-screenshot-3-729576.png')
Out[17]:

Data Science = ?

$$ \bigcup skills $$

or

$$ \bigcap skills $$

Let's look at the data!

A Meetup group dissection

In [18]:
import urllib2
import json

# Download all info from all the Data Science Meetup members. 
members = json.loads(urllib2.urlopen(
    "http://api.meetup.com/2/members?order=name" +
    "&group_urlname=Santa-Barbara-Data-Science&offset=0&format=json&page=150" +  
    "&sig_id=66734052&sig=f7fc02b7069092e6775332b25f01b69e21346b92"
    ).read())

bios = [{'name' : x['name'], 'bio': x['bio']} for x in members['results'] ]
bios[-2:]
Out[18]:
[{'bio': u'; . )', 'name': u'violet'},
 {'bio': u'Professor of Computer Science at Westmont; Consultant on data mining and prediction problems',
  'name': u'Wayne Iba'}]
In [19]:
import nltk
from collections import Counter

# Find all the named entities.
entity_counter = Counter()
for x in bios:
    text = nltk.wordpunct_tokenize(x['bio'])
    for name,tag in nltk.pos_tag(text):
        if tag == 'NNP':
            entity_counter.update([name.capitalize()])
entity_counter.most_common(5)
Out[19]:
[(u'Ucsb', 21), (u'Appfolio', 13), (u'Hi', 10), (u'Data', 10), (u'Phd', 10)]
In [38]:
import matplotlib.pyplot as plt
import numpy as np
fig = plt.figure(figsize=(15,9)); ax = fig.add_subplot(111); ax.set_xticks([]);
names,counts=zip(*[(name, count) for name, count in entity_counter.most_common(20) 
     if name not in ['Data', 'Hi','Santa', 'Barbara','D','Ph', 'My', 'Science']])
x_pos = np.arange(len(counts))
plt.bar(x_pos - .4, counts, color = '#eeefff');
for x, y, label in zip(x_pos, counts, names): plt.annotate(label,
    (x+0.1, y + len(label)/2.2), ha='center', rotation=70, size='xx-large')
In [24]:
from itertools import product
#Create a topic co-occurrence graph. Nodes are topics, edges between a,b means that a member listed both a,b as topics
nodes = Counter()
edges = Counter()

for x in members['results']:
    for y in x['topics']:
        nodes.update([y['name']])
    edges.update([(a['name'],b['name']) 
          for a,b in product(x['topics'],x['topics']) if a['name'] > b['name']])

nodes.most_common(10)
Out[24]:
[(u'Machine Learning', 50),
 (u'Data Mining', 49),
 (u'Programming', 45),
 (u'Big Data', 44),
 (u'Internet & Technology', 35),
 (u'New Technology', 33),
 (u'Web Development', 33),
 (u'Software Developers', 30),
 (u'Computer Science', 30),
 (u'Entrepreneurship', 29)]
In [25]:
import networkx as nx

g = nx.Graph()
node_names = set()

for name, count in nodes.most_common(20):
    g.add_node(name, count=count)
    node_names.add(name)

for edge, count in edges.iteritems():
    if edge[0] in node_names and edge[1] in node_names:
        g.add_edge(edge[0], edge[1], weight = count)

labels = dict([(name, name.replace(' ','\n')) for name,_ in nodes.most_common(20)])
In [30]:
fig = plt.figure(figsize=(17, 10)); ax = fig.add_subplot(111)
pos = nx.spring_layout(g,k=4.9, scale = 1000.0)
nx.draw_networkx_nodes(g, pos, node_size = [7*(d['count']**2) 
        for _, d in g.nodes_iter(data=True)], alpha = 0.8, node_color = '#eeefff')
nx.draw_networkx_labels(g, pos, labels, font_size=18);
nx.draw_networkx_edges(g, pos, width=  [(d['weight']/10.0)**2 
        for _, _, d in g.edges_iter(data=True)], alpha = 0.5, edge_color = 'g')
ax.set_xticks([]);ax.set_yticks([]);