Hasn't been explored that much compared to Twitter, mostly due to the complexity (and bad documentation) of the API
No 140-character limit like Twitter => more full, grammatically-correct words for NLP analysis
Very little support for Python (official languages include JS, PHP, among a couple others) => great problem to crack
Finally... (I) never did this before => should be interesting
Login to your Facebook account and go to https://developers.facebook.com/tools/explorer/ to obtain and set permissions for an access token.
ACCESS_TOKEN = ''
SEARCH_LIMIT = 500 # facebook allows 500 max
import facebook # pip install facebook-sdk, not facebook
import os
import random
# Plotting
import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
%matplotlib inline
# Making request, pretty print stuff
import requests
import json
import simplejson as sj
from prettytable import PrettyTable
from collections import defaultdict, Counter
# NLP!
import string
import nltk
from nltk.corpus import stopwords
import tagger as tag
from nltk.metrics import edit_distance
from pattern.vector import Document, Model, TFIDF, LEMMA, KMEANS, HIERARCHICAL, COSINE
lemmatizer = nltk.WordNetLemmatizer()
stemmer = nltk.stem.porter.PorterStemmer()
table = string.maketrans("", "")
sw = stopwords.words('english')
def pp(o):
'''
A helper function to pretty-print Python objects as JSON
'''
print json.dumps(o, indent=2)
def to_ascii(unicode_text):
'''
Converts unicode text to ascii. Also removes newline \n \r characters
'''
return unicode_text.encode('ascii', 'ignore').\
replace('\n', ' ').replace('\r', '').strip()
def strip(s, punc=False):
'''
Strips punctuation and whitespace from a string
'''
if punc:
stripped = s.strip().translate(table, string.punctuation)
return ' '.join(stripped.split())
else:
return ' '.join(s.strip().split())
def lower(word):
'''
Lowercases a word
'''
return word.lower()
def lemmatize(word):
'''
Lemmatizes a word
'''
return lemmatizer.lemmatize(word)
def stem(word):
'''
Stems a word using the Porter Stemmer
'''
return stemmer.stem_word(word)
(from Peter Norvig's famous blog post http://norvig.com/spell-correct.html)
This was tested out, but the final result below does not involve spelling correction since the final search result score seems to decrease.
import re, collections
def words(text): return re.findall('[a-z]+', text.lower())
def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
NWORDS = train(words(file('big.txt').read()))
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
def known(words): return set(w for w in words if w in NWORDS)
def correct(word):
candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
return max(candidates, key=NWORDS.get)
g = facebook.GraphAPI(ACCESS_TOKEN)
To retrieve a group's feed, you first need to obtain the group's ID. To my knowledge, Facebook doesn't offer any easy way to do that. 'View Page Source' is one option, but I've found a couple of third-party services like http://wallflux.com/facebook_id/ is much easier to use
The example below uses the Berkeley CS Group https://www.facebook.com/groups/berkeleycs/
# Only needs to make connection once
cal_cs_id = '266736903421190'
cal_cs_feed = g.get_connections(cal_cs_id, 'feed', limit=SEARCH_LIMIT)['data']
pp(cal_cs_feed[15])
{ "picture": "https://fbexternal-a.akamaihd.net/safe_image.php?d=AQALS2YIM9l4s9-y&w=154&h=154&url=http%3A%2F%2Flh3.googleusercontent.com%2Fuy3HQlPPel_qsIn6G8Za5SgyJBD6JIRx-uIEEFgF0XGbyHYeJQBPSSDpC4-wj-Ttdvg", "from": { "name": "Arjun Ghai", "id": "543681184" }, "name": "CETSA Membership Form", "caption": "docs.google.com", "privacy": { "value": "" }, "actions": [ { "link": "https://www.facebook.com/266736903421190/posts/567231850038359", "name": "Comment" }, { "link": "https://www.facebook.com/266736903421190/posts/567231850038359", "name": "Like" } ], "updated_time": "2013-12-08T22:48:10+0000", "to": { "data": [ { "name": "Computer Science", "id": "266736903421190" } ] }, "link": "https://docs.google.com/forms/d/1fb5lr77I0lLrtZuMmfuO_GWCMyQ-HxmZ7DMCf0FjLjo/viewform", "likes": { "paging": { "cursors": { "after": "NjI4NjQwMTMz", "before": "MTAwMDAwMjY2NjUyNTE5" } }, "data": [ { "id": "100000266652519", "name": "Sebastian Edward Shanus" }, { "id": "628640133", "name": "Pavan Patel" } ] }, "created_time": "2013-12-08T22:48:10+0000", "message": "Looking to put the hours you have spent understanding new material to good use and maybe make some money out of it. The Center for Entrepreneurship and Technology Student Association (CETSA) connects students to large corporations based here in the Bay Area. Fill out the membership form and see how CETSA can help grow your network and land you a job/internship. https://docs.google.com/forms/d/1fb5lr77I0lLrtZuMmfuO_GWCMyQ-HxmZ7DMCf0FjLjo/viewform", "icon": "https://fbstatic-a.akamaihd.net/rsrc.php/v2/yD/r/aS8ecmYRys0.gif", "type": "link", "id": "266736903421190_567231850038359", "description": "By signing up for CETSA, you will gain exclusive access to our internal network and tech entities such as Pandora, Yelp, Andreessen Horowitz, Spoon Rocket, and SkyDeck." }
len(cal_cs_feed)
494
def print_feed(feed):
'''
Prints out every post, along with its comments, in a feed
'''
for post in feed:
if 'message' in post:
msg = strip(to_ascii(post['message']))
print 'POST:', msg, '\n'
print 'COMMENTS:'
if 'comments' in post:
for comment in post['comments']['data']:
if 'message' in comment:
comment = strip(to_ascii(comment['message']))
if comment is not None and comment != '':
print '+', comment
print '-----------------------------------------\n'
print_feed(cal_cs_feed[10:12])
POST: Has anyone taken Stats 133? What is it like? COMMENTS: + Nope. + I took it. Waste of time for a cs major. Prepare to learn nothing new + Hmmm thanks then haha ----------------------------------------- POST: http://www.lrb.co.uk/v35/n03/rebecca-solnit/diary I think it's important to think about the social implications of science and technology, especially to those who are or will be involved in it (like us). In the attached article, Rebecca Solnit discusses the tech boom in Silicon Valley and what it means for those who benefit from it and those who don't. Might not be the shortest read, but it's well worth it. COMMENTS: -----------------------------------------
def find_link(post):
'''
Finds the permanent link to a given post
'''
if 'actions' in post:
actions = post['actions']
for action in actions:
if 'link' in action:
return action['link']
return ''
def save_feed(feed):
'''
Saves the input feed in a Python list for later processing
Also strips whitespace and lemmatizes along the way
'''
posts = []
for post in feed:
if 'message' in post and 'actions' in post:
msg = strip(to_ascii(post['message']))
link = strip(to_ascii(find_link(post)))
posts.append((msg, link))
if 'comments' in post:
for comment in post['comments']['data']:
if 'message' in comment and 'actions' in comment:
msg = strip(to_ascii(comment['message']))
link = strip(to_ascii(find_link(comment)))
if msg is not None and msg != '':
posts.append((msg, link))
return posts
feed = save_feed(cal_cs_feed)
feed[30:35]
[("Pinching and zooming isn't the greatest for mobile but why is it that very few responsive websites allow you to zoom or change the font size? Really this seems like a big oversight /soapbox Are there any mobile frameworks that focus on good accessibility?", 'https://www.facebook.com/266736903421190/posts/565404913554386'), ("I have an ASUS S400CA laptop, I tried to upgrade the RAM but can't get the computer to recognize it, even though I have flashed my bios to the newest version for my computer (209). Can anyone enlighten me on this matter?", 'https://www.facebook.com/266736903421190/posts/565004033594474'), ("easy upper div cs class that's not 188, 160, 161, or 169?", 'https://www.facebook.com/266736903421190/posts/564782570283287'), ("Hey all, I'm a junior transfer entering my second semester here at Cal, LnS CS. I've completed all pre-requisites outside of CS61C (and EE20 if you want to count it). I'm enrolled in CS61C and Stat134 next semester. I'm waitlist ~#300 in CS170 and ~#170 in CS164. I really don't have any other classes to take outside of upper division courses, but it seems like they are all full. Should I stick it out with these courses, or should I just start looking for anything that could be open? Any recommendations?", 'https://www.facebook.com/266736903421190/posts/564577966970414'), ('for the cs major planning worksheet, am I supposed to start filling in the classes that I am currently taking right now? or next semester?', 'https://www.facebook.com/266736903421190/posts/565153826912828')]
def bag_of_words_tfidf(lst):
'''
Constructs a bag of words model, where each document is a Facebook post/comment
Also applies TFIDF weighting, lemmatization, and filter out stopwords
'''
model = Model(documents=[], weight=TFIDF)
for msg, link in lst:
doc = Document(msg, stemmer=LEMMA, stopwords=True, name=msg, description=link)
model.append(doc)
return model
def cosine_similarity(model, term, num=10):
'''
Finds the cosine similarity between the input document and each document in
the corpus, and outputs the best 'num' results
'''
doc = Document(term, stemmer=LEMMA, stopwords=True, name=term)
return model.neighbors(doc, top=num)
def process_similarity(result):
'''
Processes the result in a nicely formatted table
result is a tuple of length 2, where the first item is the similarity score,
and the second item is the document itself
'''
pt = PrettyTable(field_names=['Post', 'Sim', 'Link'])
pt.align['Post'], pt.align['Sim'], pt.align['Link'] = 'l', 'l', 'l'
[ pt.add_row([res[1].name[:45] + '...', "{0:.2f}".format(res[0]),
res[1].description]) for res in result ]
return pt
# Constructs the bag of words model.
# We don't need to call this function more than once, unless the corpus changed
bag_of_words = bag_of_words_tfidf(feed)
QUERY = 'declaring major early'
NUM_SEARCH = 10
sim = cosine_similarity(bag_of_words, QUERY, NUM_SEARCH)
print process_similarity(sim)
+--------------------------------------------------+------+----------------------------------------------------------------+ | Post | Sim | Link | +--------------------------------------------------+------+----------------------------------------------------------------+ | What are the requirements for declaring the C... | 0.82 | https://www.facebook.com/266736903421190/posts/531887326906145 | | In terms of petitioning for the major, I had ... | 0.23 | https://www.facebook.com/266736903421190/posts/554112774683600 | | Hey guys, I'm an intended LSCS major. The LS-... | 0.23 | https://www.facebook.com/266736903421190/posts/537636692997875 | | Is it true that LnS CS majors are relatively ... | 0.23 | https://www.facebook.com/266736903421190/posts/534359746658903 | | Just for clarification... to declare early (w... | 0.17 | https://www.facebook.com/266736903421190/posts/554103854684492 | | Hello CS Majors, Here's an updated graphic of... | 0.17 | https://www.facebook.com/266736903421190/posts/557605601000984 | | I am currently a freshman and was trying to m... | 0.16 | https://www.facebook.com/266736903421190/posts/538359636258914 | | with the new cs major requirement...cs162 is ... | 0.16 | https://www.facebook.com/266736903421190/posts/554548241306720 | | So as an undeclared CS major finishing my pre... | 0.12 | https://www.facebook.com/266736903421190/posts/544901308938080 | | hey guys! so i am a cog sci major cs minor, d... | 0.12 | https://www.facebook.com/266736903421190/posts/535700736524804 | +--------------------------------------------------+------+----------------------------------------------------------------+
, which I think is not bad at all. The top result from this search system is:
, whereas the top result for FB search is:
# Adapted and modified from https://gist.github.com/alexbowe/879414
sentence_re = r'''(?x) # set flag to allow verbose regexps
([A-Z])(\.[A-Z])+\.? # abbreviations, e.g. U.S.A.
| \w+(-\w+)* # words with optional internal hyphens
| \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
| \.\.\. # ellipsis
| [][.,;"'?():-_`] # these are separate tokens
'''
# Noun phrase chunker
grammar = r"""
# Nouns and Adjectives, terminated with Nouns
NBAR:
{<NN.*|JJ>*<NN.*>}
# Above, connected with preposition or subordinating conjunction (in, of, etc...)
NP:
{<NBAR>}
{<NBAR><IN><NBAR>}"""
chunker = nltk.RegexpParser(grammar)
# POS tagger - see tagger.py
tagger = tag.tagger()
def leaves(tree):
'''
Finds NP (nounphrase) leaf nodes of a chunk tree
'''
for subtree in tree.subtrees(filter = lambda t: t.node=='NP'):
yield subtree.leaves()
def normalize(word):
'''
Normalizes words to lowercase and stems/lemmatizes it
'''
word = word.lower()
#word = stem(word)
word = strip(lemmatize(word), True)
return word
def acceptable_word(word):
'''
Checks conditions for acceptable word: valid length and no stopwords
'''
accepted = bool(2 <= len(word) <= 40
and word.lower() not in sw)
return accepted
def get_terms(tree):
'''
Gets all the acceptable noun_phrase term from the syntax tree
'''
for leaf in leaves(tree):
term = [normalize(w) for w, t in leaf if acceptable_word(w)]
yield term
def extract_noun_phrases(text):
'''
Extracts all noun_phrases from a given text
'''
toks = nltk.regexp_tokenize(text, sentence_re)
postoks = tagger.tag(toks)
# Builds a POS tree
tree = chunker.parse(postoks)
terms = get_terms(tree)
# Extracts Noun Phrase
noun_phrases = []
for term in terms:
np = ""
for word in term:
np += word + " "
if np != "":
noun_phrases.append(np.strip())
return noun_phrases
Successfully loaded POS tagger
def extract_feed(feed):
'''
Extracts popular topics (noun phrases) from a feed, and builds a simple
counter to keep track of the popularity
'''
topics = defaultdict(int)
for post, link in feed:
noun_phrases = extract_noun_phrases(post)
for np in noun_phrases:
if np != '':
topics[np] += 1
return topics
topics = extract_feed(feed)
c = Counter(topics)
c.most_common(20)
[('http', 89), ('class', 65), ('www', 63), ('semester', 63), ('c', 52), ('facebook', 41), ('phase', 40), ('ve', 38), ('course', 33), ('student', 30), ('guy', 29), ('re', 27), ('thanks', 26), ('61b', 25), ('math', 24), ('ee20', 21), ('people', 20), ('cs61b', 20), ('telebears', 19), ('question', 18)]
from pytagcloud import create_tag_image, create_html_data, make_tags, LAYOUT_HORIZONTAL, LAYOUTS
from pytagcloud.colors import COLOR_SCHEMES
from operator import itemgetter
def get_tag_counts(counter):
'''
Get the noun phrase counts for word cloud by first converting the counter to a dict
'''
return sorted(dict(counter).iteritems(), key=itemgetter(1), reverse=True)
def create_cloud(counter, filename):
'''
Creates a word cloud from a counter
'''
tags = make_tags(get_tag_counts(counter)[:80], maxsize=120,
colors=COLOR_SCHEMES['goldfish'])
create_tag_image(tags, './img/' + filename + '.png',
size=(900, 600), background=(0, 0, 0, 255),
layout=LAYOUT_HORIZONTAL, fontname='Lobster')
create_cloud(c, 'cloud_large')
for word in ["http", "www", "facebook"]:
topics[word] = 0
c = Counter(topics)
c.most_common(20)
[('class', 65), ('semester', 63), ('c', 52), ('phase', 40), ('ve', 38), ('course', 33), ('student', 30), ('guy', 29), ('re', 27), ('thanks', 26), ('61b', 25), ('math', 24), ('ee20', 21), ('people', 20), ('cs61b', 20), ('telebears', 19), ('question', 18), ('year', 17), ('61c', 17), ('waitlist', 17)]
create_cloud(c, 'cloud_large_1')
Part I:
Tried out a few different queries with the simple TFIDF search, and the result seems to be on par with FB Search. The ranking, as seen above, is different, however. More are discussed in the final report.
Revisit and try out different algorithms, including stemming, spelling correction, variants of TFIDF (maxmimum TF normalization, sublinear scaling, etc.) http://nlp.stanford.edu/IR-book/html/htmledition/variant-tf-idf-functions-1.html
Part II:
There are quite some noises ('http', 'www', etc.) and words that don't really tell you anything new ('class', 'course', etc.). Is there a better way to filter them out instead of just hardcoding them?
Perhaps we can combine Part I and II together, i.e. filter out the words that appear a lot across the document (high IDF score)?
The overall result seems consistent with what most CS students usually discuss on FB ('telebears', '61B', etc.)
Overall a great project to work on and extend in the future!