This notebook will walk you through how to create and analyze networks using Twitter data.
To make a network in NetworkX using external data, the nodes and the connections between them must be represented by pairs of tuples. In this first section, we'll walk through some data preprocessing techniques together to get our data ready for analysis.
Let's take a look at the data we're working with.
import json
f = open('../materials/data/friends/list.PyTennessee.json')
data = json.load(f)
pairs = []
for user in data['users']:
pairs.append(('PyTennessee', str(user['screen_name'])))
pairs[:10]
If you run the section below, we'll end up with all of the friend and follower pairs across all of the files. You should end up with 1286 pairs.
# Because the relationship data is split across files, we need to
# walk through all of them to get the data.
import os
for (dir_path, dir_names, file_names) in os.walk('../materials/data/friend_relationships/'):
files = file_names
for file_name in files:
with open('../materials/data/friend_relationships/' + file_name) as p:
pair_data = json.load(p)
for k in pair_data.keys():
twitter_pair = k.split()
if pair_data[k]['relationship']['source']['following'] is True:
pairs.append((str(twitter_pair[0]), str(twitter_pair[1])))
elif pair_data[k]['relationship']['source']['followed_by'] is True:
pairs.append((str(twitter_pair[1]), str(twitter_pair[0])))
len(pairs)
%matplotlib inline
import networkx as nx
# Build an undirected graph.
# Just from looking at it, is this network connected or unconnected?
# Hint: if you want to sort a dictionary to easily
# find the highest and lowest values, use this function
# on the output of the centrality measures like degree_centrality():
import operator
def centrality_sort(centrality_dict):
return sorted(centrality_dict.iteritems(), key=operator.itemgetter(1))
# ex. degree_sorted = centrality_sort(degree_vals)
# Which nodes have the highest/lowest degree centrality?
# Which nodes have the highest/lowest betweenness centrality?
# Which nodes have the highest/lowest closeness centrality?
# Let's look at subsections of the graph. We'll do this together.
Let's add some direction to the graph. When we processed our data, we ordered the pairs so that the first handle in the pair is a follower of the second handle. We're not worrying about pairs that mutually follow each other right now.
# Build a directed graph.
# Run some degree centrality measures for directed graphs:
# in_degree_centrality(): number of incoming connections (number of people following you)
# out_degree_centrality(): number of outgoing connections (number of people you follow)
# Let's look at subsections of the graph. Just like we did above.
# Top 20 highest in-degree centrality scores:
# Top 20 highest out-degree centrality scores:
Does our network match any of the network models we discussed earlier?
# Analyze the models here.