Notebook

Khoa Tran - INFO 256 Fall 2013¶

Final Project Demo¶

Facebook Topics Extraction System¶

Link to Final Project Report: https://www.dropbox.com/s/53g8z9do5ciy4of/report.pdf ¶

Project Goals:¶

Identify popular topics in a given facebook page
Find similar posts related to a user's interest, and compare the results with facebook's

Motivation¶

FB Search seems to be missing some results
See how well a simple TFIDF model would perform against FB Search
Finally... find out some popular topics being discussed without reading thru every post

Why Facebook?¶

Hasn't been explored that much compared to Twitter, mostly due to the complexity (~~and bad documentation~~) of the API
No 140-character limit like Twitter => more full, grammatically-correct words for NLP analysis
Very little support for Python (official languages include JS, PHP, among a couple others) => great problem to crack
Finally... (I) never did this before => should be interesting

Connects to Facebook¶

Login to your Facebook account and go to https://developers.facebook.com/tools/explorer/ to obtain and set permissions for an access token.

In [1]:

ACCESS_TOKEN = ''
SEARCH_LIMIT = 500  # facebook allows 500 max

Import packages and dependencies¶

In [2]:

import facebook  # pip install facebook-sdk, not facebook
import os
import random

# Plotting
import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
%matplotlib inline

# Making request, pretty print stuff
import requests
import json
import simplejson as sj
from prettytable import PrettyTable
from collections import defaultdict, Counter

# NLP!
import string
import nltk
from nltk.corpus import stopwords
import tagger as tag
from nltk.metrics import edit_distance
from pattern.vector import Document, Model, TFIDF, LEMMA, KMEANS, HIERARCHICAL, COSINE

Lemmatizer, stemmer, and spelling corrector¶

In [3]:

lemmatizer = nltk.WordNetLemmatizer()
stemmer = nltk.stem.porter.PorterStemmer()
table = string.maketrans("", "")
sw = stopwords.words('english')

Helper functions for pretty printing, converting text to ascii, etc.¶

In [4]:

def pp(o):
    '''
    A helper function to pretty-print Python objects as JSON
    '''
    print json.dumps(o, indent=2)
    
def to_ascii(unicode_text):
    '''
    Converts unicode text to ascii. Also removes newline \n \r characters
    '''
    return unicode_text.encode('ascii', 'ignore').\
            replace('\n', ' ').replace('\r', '').strip()
    
def strip(s, punc=False):
    '''
    Strips punctuation and whitespace from a string
    '''
    if punc:
        stripped = s.strip().translate(table, string.punctuation)
        return ' '.join(stripped.split())
    else:
        return ' '.join(s.strip().split())

def lower(word):
    '''
    Lowercases a word
    '''
    return word.lower()

def lemmatize(word):
    '''
    Lemmatizes a word
    '''
    return lemmatizer.lemmatize(word)

def stem(word):
    '''
    Stems a word using the Porter Stemmer
    '''
    return stemmer.stem_word(word)

Spelling correction¶

(from Peter Norvig's famous blog post http://norvig.com/spell-correct.html)

This was tested out, but the final result below does not involve spelling correction since the final search result score seems to decrease.

In [30]:

import re, collections

def words(text): return re.findall('[a-z]+', text.lower()) 

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

NWORDS = train(words(file('big.txt').read()))

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):
   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    = [a + c + b     for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)

Creates a connection to the Graph API with your access token¶

In [6]:

g = facebook.GraphAPI(ACCESS_TOKEN)

Part I: Retrieves a group's feed, and builds a simple search engine with TFIDF and Cosine Similarity¶

To retrieve a group's feed, you first need to obtain the group's ID. To my knowledge, Facebook doesn't offer any easy way to do that. 'View Page Source' is one option, but I've found a couple of third-party services like http://wallflux.com/facebook_id/ is much easier to use

The example below uses the Berkeley CS Group https://www.facebook.com/groups/berkeleycs/

In [7]:

# Only needs to make connection once
cal_cs_id = '266736903421190'
cal_cs_feed = g.get_connections(cal_cs_id, 'feed', limit=SEARCH_LIMIT)['data']

In [8]:

pp(cal_cs_feed[15])

{
  "picture": "https://fbexternal-a.akamaihd.net/safe_image.php?d=AQALS2YIM9l4s9-y&w=154&h=154&url=http%3A%2F%2Flh3.googleusercontent.com%2Fuy3HQlPPel_qsIn6G8Za5SgyJBD6JIRx-uIEEFgF0XGbyHYeJQBPSSDpC4-wj-Ttdvg", 
  "from": {
    "name": "Arjun Ghai", 
    "id": "543681184"
  }, 
  "name": "CETSA Membership Form", 
  "caption": "docs.google.com", 
  "privacy": {
    "value": ""
  }, 
  "actions": [
    {
      "link": "https://www.facebook.com/266736903421190/posts/567231850038359", 
      "name": "Comment"
    }, 
    {
      "link": "https://www.facebook.com/266736903421190/posts/567231850038359", 
      "name": "Like"
    }
  ], 
  "updated_time": "2013-12-08T22:48:10+0000", 
  "to": {
    "data": [
      {
        "name": "Computer Science", 
        "id": "266736903421190"
      }
    ]
  }, 
  "link": "https://docs.google.com/forms/d/1fb5lr77I0lLrtZuMmfuO_GWCMyQ-HxmZ7DMCf0FjLjo/viewform", 
  "likes": {
    "paging": {
      "cursors": {
        "after": "NjI4NjQwMTMz", 
        "before": "MTAwMDAwMjY2NjUyNTE5"
      }
    }, 
    "data": [
      {
        "id": "100000266652519", 
        "name": "Sebastian Edward Shanus"
      }, 
      {
        "id": "628640133", 
        "name": "Pavan Patel"
      }
    ]
  }, 
  "created_time": "2013-12-08T22:48:10+0000", 
  "message": "Looking to put the hours you have spent understanding new material to good use and maybe make some money out of it. The Center for Entrepreneurship and Technology Student Association (CETSA) connects students to large corporations based here in the Bay Area. Fill out the membership form and see how CETSA can help grow your network and land you a job/internship. https://docs.google.com/forms/d/1fb5lr77I0lLrtZuMmfuO_GWCMyQ-HxmZ7DMCf0FjLjo/viewform", 
  "icon": "https://fbstatic-a.akamaihd.net/rsrc.php/v2/yD/r/aS8ecmYRys0.gif", 
  "type": "link", 
  "id": "266736903421190_567231850038359", 
  "description": "By signing up for CETSA, you will gain exclusive access to our internal network and tech entities such as Pandora, Yelp, Andreessen Horowitz, Spoon Rocket, and SkyDeck."
}

In [9]:

len(cal_cs_feed)

Out[9]:

In [10]:

def print_feed(feed):
    '''
    Prints out every post, along with its comments, in a feed
    '''
    for post in feed:
        if 'message' in post:
            msg = strip(to_ascii(post['message']))
            print 'POST:', msg, '\n'
        
        print 'COMMENTS:'
        if 'comments' in post:
            for comment in post['comments']['data']:
                if 'message' in comment:
                    comment = strip(to_ascii(comment['message']))
                    if comment is not None and comment != '':
                        print '+', comment
        print '-----------------------------------------\n'
        
print_feed(cal_cs_feed[10:12])

POST: Has anyone taken Stats 133? What is it like? 

COMMENTS:
+ Nope.
+ I took it. Waste of time for a cs major. Prepare to learn nothing new
+ Hmmm thanks then haha
-----------------------------------------

POST: http://www.lrb.co.uk/v35/n03/rebecca-solnit/diary I think it's important to think about the social implications of science and technology, especially to those who are or will be involved in it (like us). In the attached article, Rebecca Solnit discusses the tech boom in Silicon Valley and what it means for those who benefit from it and those who don't. Might not be the shortest read, but it's well worth it. 

COMMENTS:
-----------------------------------------

In [11]:

def find_link(post):
    '''
    Finds the permanent link to a given post
    '''
    if 'actions' in post:
        actions = post['actions']
        for action in actions:
            if 'link' in action:
                return action['link']
    return ''
    
def save_feed(feed):
    '''
    Saves the input feed in a Python list for later processing
    Also strips whitespace and lemmatizes along the way
    '''
    posts = []
    for post in feed:
        if 'message' in post and 'actions' in post:
            msg = strip(to_ascii(post['message']))
            link = strip(to_ascii(find_link(post)))
            posts.append((msg, link))
        
        if 'comments' in post:
            for comment in post['comments']['data']:
                if 'message' in comment and 'actions' in comment:
                    msg = strip(to_ascii(comment['message']))
                    link = strip(to_ascii(find_link(comment)))
                    if msg is not None and msg != '':
                        posts.append((msg, link))
    return posts
                
feed = save_feed(cal_cs_feed)
feed[30:35]

Out[11]:

[("Pinching and zooming isn't the greatest for mobile but why is it that very few responsive websites allow you to zoom or change the font size? Really this seems like a big oversight /soapbox Are there any mobile frameworks that focus on good accessibility?",
'https://www.facebook.com/266736903421190/posts/565404913554386'),
("I have an ASUS S400CA laptop, I tried to upgrade the RAM but can't get the computer to recognize it, even though I have flashed my bios to the newest version for my computer (209). Can anyone enlighten me on this matter?",
'https://www.facebook.com/266736903421190/posts/565004033594474'),
("easy upper div cs class that's not 188, 160, 161, or 169?",
'https://www.facebook.com/266736903421190/posts/564782570283287'),
("Hey all, I'm a junior transfer entering my second semester here at Cal, LnS CS. I've completed all pre-requisites outside of CS61C (and EE20 if you want to count it). I'm enrolled in CS61C and Stat134 next semester. I'm waitlist ~#300 in CS170 and ~#170 in CS164. I really don't have any other classes to take outside of upper division courses, but it seems like they are all full. Should I stick it out with these courses, or should I just start looking for anything that could be open? Any recommendations?",
'https://www.facebook.com/266736903421190/posts/564577966970414'),
('for the cs major planning worksheet, am I supposed to start filling in the classes that I am currently taking right now? or next semester?',
'https://www.facebook.com/266736903421190/posts/565153826912828')]

In [12]:

def bag_of_words_tfidf(lst):
    '''
    Constructs a bag of words model, where each document is a Facebook post/comment
    Also applies TFIDF weighting, lemmatization, and filter out stopwords
    '''
    model = Model(documents=[], weight=TFIDF)
    for msg, link in lst:
        doc = Document(msg, stemmer=LEMMA, stopwords=True, name=msg, description=link)
        model.append(doc)
    return model

def cosine_similarity(model, term, num=10):
    '''
    Finds the cosine similarity between the input document and each document in 
    the corpus, and outputs the best 'num' results
    '''
    doc = Document(term, stemmer=LEMMA, stopwords=True, name=term)
    return model.neighbors(doc, top=num)

def process_similarity(result):
    '''
    Processes the result in a nicely formatted table
    
    result is a tuple of length 2, where the first item is the similarity score, 
    and the second item is the document itself
    '''
    pt = PrettyTable(field_names=['Post', 'Sim', 'Link'])
    pt.align['Post'], pt.align['Sim'], pt.align['Link'] = 'l', 'l', 'l'
    [ pt.add_row([res[1].name[:45] + '...', "{0:.2f}".format(res[0]), 
                  res[1].description]) for res in result ]
    return pt

In [13]:

# Constructs the bag of words model.
# We don't need to call this function more than once, unless the corpus changed
bag_of_words = bag_of_words_tfidf(feed)

Enter your query below, along with the number of results you want to search for¶

In [14]:

QUERY = 'declaring major early'
NUM_SEARCH = 10

In [15]:

sim = cosine_similarity(bag_of_words, QUERY, NUM_SEARCH)
print process_similarity(sim)

+--------------------------------------------------+------+----------------------------------------------------------------+
| Post                                             | Sim  | Link                                                           |
+--------------------------------------------------+------+----------------------------------------------------------------+
| What are the requirements for declaring the C... | 0.82 | https://www.facebook.com/266736903421190/posts/531887326906145 |
| In terms of petitioning for the major, I had ... | 0.23 | https://www.facebook.com/266736903421190/posts/554112774683600 |
| Hey guys, I'm an intended LSCS major. The LS-... | 0.23 | https://www.facebook.com/266736903421190/posts/537636692997875 |
| Is it true that LnS CS majors are relatively ... | 0.23 | https://www.facebook.com/266736903421190/posts/534359746658903 |
| Just for clarification... to declare early (w... | 0.17 | https://www.facebook.com/266736903421190/posts/554103854684492 |
| Hello CS Majors, Here's an updated graphic of... | 0.17 | https://www.facebook.com/266736903421190/posts/557605601000984 |
| I am currently a freshman and was trying to m... | 0.16 | https://www.facebook.com/266736903421190/posts/538359636258914 |
| with the new cs major requirement...cs162 is ... | 0.16 | https://www.facebook.com/266736903421190/posts/554548241306720 |
| So as an undeclared CS major finishing my pre... | 0.12 | https://www.facebook.com/266736903421190/posts/544901308938080 |
| hey guys! so i am a cog sci major cs minor, d... | 0.12 | https://www.facebook.com/266736903421190/posts/535700736524804 |
+--------------------------------------------------+------+----------------------------------------------------------------+

compares with¶

, which I think is not bad at all. The top result from this search system is:

, whereas the top result for FB search is:

Part II: What are the most popular topics in a group's feed right now?¶

In [16]:

# Adapted and modified from https://gist.github.com/alexbowe/879414

sentence_re = r'''(?x)      # set flag to allow verbose regexps
      ([A-Z])(\.[A-Z])+\.?  # abbreviations, e.g. U.S.A.
    | \w+(-\w+)*            # words with optional internal hyphens
    | \$?\d+(\.\d+)?%?      # currency and percentages, e.g. $12.40, 82%
    | \.\.\.                # ellipsis
    | [][.,;"'?():-_`]      # these are separate tokens
'''

# Noun phrase chunker
grammar = r"""
    # Nouns and Adjectives, terminated with Nouns
    NBAR:
        {<NN.*|JJ>*<NN.*>}
        
    # Above, connected with preposition or subordinating conjunction (in, of, etc...)
    NP:
        {<NBAR>}
        {<NBAR><IN><NBAR>}"""
chunker = nltk.RegexpParser(grammar)

# POS tagger - see tagger.py
tagger = tag.tagger()

def leaves(tree):
    '''
    Finds NP (nounphrase) leaf nodes of a chunk tree
    '''
    for subtree in tree.subtrees(filter = lambda t: t.node=='NP'):
        yield subtree.leaves()

def normalize(word):
    '''
    Normalizes words to lowercase and stems/lemmatizes it
    '''
    word = word.lower()
    #word = stem(word)
    word = strip(lemmatize(word), True)
    return word

def acceptable_word(word):
    '''
    Checks conditions for acceptable word: valid length and no stopwords
    '''
    accepted = bool(2 <= len(word) <= 40
        and word.lower() not in sw)
    return accepted

def get_terms(tree):
    '''
    Gets all the acceptable noun_phrase term from the syntax tree
    '''
    for leaf in leaves(tree):
        term = [normalize(w) for w, t in leaf if acceptable_word(w)]
        yield term

def extract_noun_phrases(text):
    '''
    Extracts all noun_phrases from a given text
    '''
    toks = nltk.regexp_tokenize(text, sentence_re)
    postoks = tagger.tag(toks)

    # Builds a POS tree
    tree = chunker.parse(postoks)
    terms = get_terms(tree)

    # Extracts Noun Phrase
    noun_phrases = []
    for term in terms:
        np = ""
        for word in term:
            np += word + " "
        if np != "":
            noun_phrases.append(np.strip())
    return noun_phrases

Successfully loaded POS tagger

In [57]:

def extract_feed(feed):
    '''
    Extracts popular topics (noun phrases) from a feed, and builds a simple
    counter to keep track of the popularity
    '''
    topics = defaultdict(int)
    for post, link in feed:
        noun_phrases = extract_noun_phrases(post)
        for np in noun_phrases:
            if np != '':
                topics[np] += 1
    return topics

In [58]:

topics = extract_feed(feed)
c = Counter(topics)
c.most_common(20)

Out[58]:

[('http', 89),
 ('class', 65),
 ('www', 63),
 ('semester', 63),
 ('c', 52),
 ('facebook', 41),
 ('phase', 40),
 ('ve', 38),
 ('course', 33),
 ('student', 30),
 ('guy', 29),
 ('re', 27),
 ('thanks', 26),
 ('61b', 25),
 ('math', 24),
 ('ee20', 21),
 ('people', 20),
 ('cs61b', 20),
 ('telebears', 19),
 ('question', 18)]

In [56]:

from pytagcloud import create_tag_image, create_html_data, make_tags, LAYOUT_HORIZONTAL, LAYOUTS
from pytagcloud.colors import COLOR_SCHEMES
from operator import itemgetter

def get_tag_counts(counter):
    '''
    Get the noun phrase counts for word cloud by first converting the counter to a dict
    '''
    return sorted(dict(counter).iteritems(), key=itemgetter(1), reverse=True)
    
def create_cloud(counter, filename):
    '''
    Creates a word cloud from a counter
    '''
    tags = make_tags(get_tag_counts(counter)[:80], maxsize=120, 
                     colors=COLOR_SCHEMES['goldfish'])
    create_tag_image(tags, './img/' + filename + '.png', 
                     size=(900, 600), background=(0, 0, 0, 255), 
                     layout=LAYOUT_HORIZONTAL, fontname='Lobster')

In [59]:

create_cloud(c, 'cloud_large')

Filtering out noises and see what we get!¶

In [51]:

for word in ["http", "www", "facebook"]:
    topics[word] = 0

In [52]:

c = Counter(topics)
c.most_common(20)

Out[52]:

[('class', 65),
 ('semester', 63),
 ('c', 52),
 ('phase', 40),
 ('ve', 38),
 ('course', 33),
 ('student', 30),
 ('guy', 29),
 ('re', 27),
 ('thanks', 26),
 ('61b', 25),
 ('math', 24),
 ('ee20', 21),
 ('people', 20),
 ('cs61b', 20),
 ('telebears', 19),
 ('question', 18),
 ('year', 17),
 ('61c', 17),
 ('waitlist', 17)]

In [55]:

create_cloud(c, 'cloud_large_1')

Challenges:¶

Graph API's not-very-great documentation
FB Access Token expires every two hours...
FB limits the number of posts to be max 500. This may lead to inaccurate search information, especially when there are for sure more than 500 posts in the Berkeley CS FB group. Solution => include pagination to fetch everything in the future...

Results:¶

Part I:

Tried out a few different queries with the simple TFIDF search, and the result seems to be on par with FB Search. The ranking, as seen above, is different, however. More are discussed in the final report.
Revisit and try out different algorithms, including stemming, spelling correction, variants of TFIDF (maxmimum TF normalization, sublinear scaling, etc.) http://nlp.stanford.edu/IR-book/html/htmledition/variant-tf-idf-functions-1.html

Part II:

There are quite some noises ('http', 'www', etc.) and words that don't really tell you anything new ('class', 'course', etc.). Is there a better way to filter them out instead of just hardcoding them?
Perhaps we can combine Part I and II together, i.e. filter out the words that appear a lot across the document (high IDF score)?
The overall result seems consistent with what most CS students usually discuss on FB ('telebears', '61B', etc.)

Overall a great project to work on and extend in the future!