Preddict it!: A Data Science Final Project

By Jay Sayre, Serguei Balanovich, Sebastian Gehrmann

Please visit our website for more information

Thursday, December 12, 11:59pm


In [1]:
%matplotlib inline

import json

import numpy as np
import copy
import pandas as pd
import networkx as nx
import requests
import scipy
from pattern import web
import matplotlib.pyplot as plt
import matplotlib.pylab as plt2
from scipy.stats import pearsonr
from datetime import datetime
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics.pairwise import cosine_similarity
from myalchemy import MyAlchemy



from sklearn import svm, tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import RidgeClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestCentroid
from sklearn.utils.extmath import density
from sklearn import metrics

# set some nicer defaults for matplotlib
from matplotlib import rcParams

#these colors come from colorbrewer2.org. Each is an RGB triplet
dark2_colors = [(0.10588235294117647, 0.6196078431372549, 0.4666666666666667),
                (0.8509803921568627, 0.37254901960784315, 0.00784313725490196),
                (0.4588235294117647, 0.4392156862745098, 0.7019607843137254),
                (0.9058823529411765, 0.1607843137254902, 0.5411764705882353),
                (0.4, 0.6509803921568628, 0.11764705882352941),
                (0.9019607843137255, 0.6705882352941176, 0.00784313725490196),
                (0.6509803921568628, 0.4627450980392157, 0.11372549019607843),
                (0.4, 0.4, 0.4)]

rcParams['figure.figsize'] = (10, 6)
rcParams['figure.dpi'] = 150
rcParams['axes.color_cycle'] = dark2_colors
rcParams['lines.linewidth'] = 2
rcParams['axes.grid'] = False
rcParams['axes.facecolor'] = 'white'
rcParams['font.size'] = 14
rcParams['patch.edgecolor'] = 'none'

def remove_border(axes=None, top=False, right=False, left=True, bottom=True):
    """
    Minimize chartjunk by stripping out unnecessary plot borders and axis ticks
    
    The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn
    """
    ax = axes or plt.gca()
    ax.spines['top'].set_visible(top)
    ax.spines['right'].set_visible(right)
    ax.spines['left'].set_visible(left)
    ax.spines['bottom'].set_visible(bottom)
    
    #turn off all ticks
    ax.yaxis.set_ticks_position('none')
    ax.xaxis.set_ticks_position('none')
    
    #now re-enable visibles
    if top:
        ax.xaxis.tick_top()
    if bottom:
        ax.xaxis.tick_bottom()
    if left:
        ax.yaxis.tick_left()
    if right:
        ax.yaxis.tick_right()

Project Outline

"P/r/eddict It!" is the final project by Jay Sayre, Serguei Balanovich, and Sebastian Gehrmann created for the course CS109 - Data Science at Harvard University (cs109.org).

The course covered many interesting topics in Data Science, from scraping, cleaning, and visualizing data, to constructing statistical models in an attempt to predict certain outcomes or make valuable suggestions. In one of the lessons, the algorithm that generates the front page of the social media website Reddit was briefly discussed. This lesson caught our attention, since Reddit is a service that allows users to submit links and discuss and vote for certain posts by providing "Up" or "Down" votes and moving the overall scores of posts in such a way as to allow for quick virality of posts. Because of this unique feature, Reddit seemed like a very interesting place from which to construct a large dataset and analyze it to find trends in posts, likes, and comments. So as to generate the front page and bring the most viral posts to the front, Reddit employs an algorithm that takes into account posting time and the score to generate the order in which the content is shown.

An example of a successful post which gained national attention was actually one of the visualizations from Homework 5 due to its unusually high score in the DataIsBeautiful Subreddit. Due to this attention, almost all big media web sites like yahoo covered the visualization that shows the bipartisanship in the US senate.

We found it really interesting that a single post could generate such a huge media attention and wondered whether it was at all possible to predict or approximate such success. Since Reddit is split up into a variety of differnet topic-based forums known as subreddits, each of which has its own culture and community, it was necessary for us to take a subset of them and to investigate on an individual basis what made any given post successful, both compared to other posts in the subreddit and compared to posts in general. There are two types of posts on Reddit: self posts that consist solely of text and link posts that have only a title text that links to some website or image. Due to the nature of this project, and our interest in the textual analysis of Reddit posts, we chose to concentrate on working only with textual posts and seeing if we were able to predict success using more advanced classification techniques than those used throughout the duration of the course.

This processbook shows only the most important code that we needed to progress building the models and understanding the data. There are 7 other IPython notebooks which contain our experimental code, data scraping that takes far too long (sometimes upwards to 30 hours fora single cell), or simply code that was useful but extraneous. To keep the Process Book concise and readable, we have therefore separated all of that out, but have placed strategic parkers throughout this file to indicate where additional files might be useful to look at. Finally, as a word of caution, we would advise not running this notebook with less than 1GB of free RAM as it will take up much of the computer's resources.

Finding subreddits

We begin with constructing a mental model of Reddit and considering the most important elements that we will need to collect before proceeding with statistical analysis. There are more then 278,000 subreddits on Reddit and each has its own rules and community which means that each requires a different approach to predict a post's success. Depending on the size of the community, a different score or number of comments will count as success. For instance, a subreddit dedicated to jokes will have much different indicators of success than a subreddit dedicated to science.

We wanted to compile a list of roughly 10 largely-text based subreddits that have a large community and cover diverse topics. We found most of the subreddits we will look at here.

This is our list:

  • explainlikeimfive, a subreddit where people ask questions and get answers accessible to laypeople
  • AskReddit, a subreddit where people ask questions about anything. This is probably the largest and most actice subreddit on reddit
  • TalesFromTechsupport, a subreddit where people working in tech support tell stories from their job
  • talesFromRetail, which originated from talesfromtechsupport. People there work in retail and tell stories from their job.
  • pettyrevenge, where people tell stories about revenges they got
  • askhistorians, where you can ask historians about anything related to history
  • askscience, where you can ask scientists about anything related to science
  • tifu, which is short for "today i f***ed up" and where people tell stories how they did
  • nosleep, where people tell scary stories
  • jokes, a subreddit dedicated to jokes
  • atheism, where people who are atheists discuss religious topics
  • politics, a subreddit about political news and discussions

Getting data

Reddit offers an API to access their site which can be found here. We used it to download all possible data sets for our list of subreddits. Reddit allows to access the top 1,000 posts for each of the following lists: Top (all / week / day), new and hot.

At first we used a wrapper for the API called praw to download our data. We found it to be very slow, and which was problematic given the sheer amount of data we required. For this reason, instead we wrote our own functions to access the reddit API. Ultimately, using our scraping algorithms, downloading the posts for all subreddits gave us a list of a total of 44,000 submissions. Reddit's API has some issues that we ran into, unfortunately, as it can only serve up 25-100 posts on every call, and they require that each request is spaced at least 2 seconds apart.

The Big Concern

We were hoping initially that we could take all of the entries for a given subreddit in the "top all" section, referring to the top scoring posts (where score$=upvotes-downvotes$) in the subreddit for all time. However, we discovered that Reddit's server will only deliver the top 1000 results for any section, which introduced some problems. Moreover, Reddit will not serve more than this limit on the website, making manually scraping the data not an option either. To overcome this, we scraped all the available posts in each section provided by the API, consisting of "top day", "top week", "top all", "hot", and "new". The algorithm used to classify hot posts was discussed in class, but new simply gives us the newest 1000 (or less) posts in the subreddit. This method of scraping from differenet was employed in order to try to see both the high and low scoring posts. However, ultimately, this method is admittedly imperfect, and the "correct" way to do this would be to examine new posts over a long (few months or longer) period of time, in order to do a time series of how the scores of each post changes. However, given our limited amount of time, we were forced to try this method. Ultimately, there was alot of of overlap in the data sets, giving us fewer unique entries than we had though - about 26,000. Fortunately though, it appears like posts usually come into popularity within 72 hours or so, so it's unlikely that there would be incorrect classification of posts that might become popular later on. It's always possible, but it doesn't seem to be a huge, overarching concern.

Cleaning the data

After spending much time trying to perform some rudimentary data analysis and visualization, we noticed that we were getting numerous errors for unexplainable reasons. After trying to clean out offending elements from our dataset by hand, we ultimately decided that it would be best to systematically clean all data. Although the reddit API worked fine most of the time, sometimes titles were stored as data types other than strings and scores were not stored as ints or floats. We had to make sure to drop all offending posts from our dataset and convert the incorrect types to the appropriate ones before proceeding.

Additionally, we did not want to look at moderator posts because they were not community-driven and did not behave in the same way as normal posts. They stood out among the others and were more successful by nature. There was no reason to let those posts influence our function. Also, we removed the posts with media in them, wishing to only examine the effect of text upon posts, and not links or images, unlike what these Stanford researchers did. If you would like to see all of the small changes we made to the dataset to clean it out, please refer to the code in the notebook redditscraping.ipynb

Adding to our data

Even after we downloaded all of the posts we wanted to look at, there was still much important information missing. In order to compile a single comprehensive dataset, we had to undergo the following three steps.

Part 1 - merging files

Since we downloaded data from different subreddits and only got 1,000 entries at a time we created a ton of .csv files. The first step was to merge these by opening them sequentially and saving them into a single large table (and ultimately CSV file).

Part 2 - downloading extra information

In the original download, although many differnet fields were collected, we were still missing some information. For instance, we had to make seperate api calls for the karma of each use to check if there was some valuable correlation there that we could later use. For this reason, it was necessary to expand the table and download this information into it.

Part 3 - downloading comments

For each one of our 44,000 submissions we also wanted to predict and look at the scores of comments. The text of the comments is often related to the text of the post and we wanted to get the top ranked comments and add them to our text analysis to ensure that we got as much data about every post as possible at the scraping stage. For this we needed to download all comments, which, using the reddit API allowed us to get the top 200 comments for each post. We did this and merged all comments of each subreddit into one file. If you would like to see the code for this portion of the project, please refer to the notebook datapreparation.ipynb

Alchemy

We found an API for a natural language processing service called Alchemy. Alchemy offers a wide range of features for text analysis including sentiment analysis and keyword extraction.

We wanted to try to use Alchemy's results in order to help with our analysis of the texts and titles of posts. This has the advantage that we are able to not only look at raw text but actually analyse it without much effort and utilize either "concepts" or "keywords" of posts to do further text analysis that would have been impossible with just the raw titles. Though this library did nothing for us at first, by the end of the project, we found a great application of the alchemy keywords.

Alchemy, has a class designed for use in Python, which was unfortunately was also quite slow to use. Due to the size of our data set we needed something with better performace, so we wrote our own class for it - the code for this class is in the python file myalchemy.py.

Data Exploring

After gathering and cleaning all the data, and running some basic experiments with Alchemy, we now had all the necessary data and a basic understanding of its shape, sparisty, and nature to take our first in depth look at it. The hypothesis we set out to confirm was whether or not it was possible to predict certain posts given the history of previous posts and their properties. Thus, we set out intitially to find correlations between these different attributes of posts to see if anything was obviously useful.

We began with looking at simple metrics. For instance, how many distinct authors are there in our data set?

In [2]:
big_table = pd.read_csv('Data/full.csv', encoding='utf-8')
big_table = big_table[big_table['author'] != "deleted"]
print "Number of posts: ", len(big_table)
print "Number of distinct authors: ", len(big_table .groupby('author'))
Number of posts:  43749
Number of distinct authors:  20044

We have roughly 44,000 posts by 20,000 authors. This means that the average successful user has approximately two submissions in the lists we investigate. This indicates that there are probably users that have much more (successful posts). As in most communities there will probably exist power users with numerous of successful posts. In order to investigate this we look at the 10 most successful users and plot a histogram with the number of posts for each individual user.

In [3]:
def get_author_stats():
    author_table = big_table.groupby('author')
    author_count = author_table['author'].count()
    author_count.sort()
    return author_count
author_count = get_author_stats()
author_count[-10:]
Out[3]:
author
Vladith            69
maxwellhill        74
UserName           75
mepper             78
shadowbanmeplz     81
AdelleChattre      83
FredFltStn         87
wattmeter          91
BurtonDesque      178
drewiepoodle      276
dtype: int64
In [4]:
plt.hist(author_count, bins = 20, log=True)
plt.title("Distribution of number of submissions")
remove_border()

We found similar results while plotting the number of comments and the score of a post. Only a very small fraction of the posts and users are actually successful but there are some outstanding individual entities. You can find the other visualizations in the datavisualization.ipynb notebook.

Now, we try to measure the rate of successful users with more than one post. We can look at this ratio for every type of data that we have and see whether there are differences in the activity of people.

In [5]:
types = list(big_table['type'].unique())

'''
returns:
- the number of active users with more than 2 posts
- the number of distinct authors
- the ratio of active/distinct users
for a subreddit
'''
def get_sub_stats(subreddit):
    author_table = subreddit.groupby('author')
    dist_authors = len(subreddit.groupby('author'))
    #print "Number of distinct authors: ", dist_authors
    successful_authors = subreddit[author_table.author.transform(lambda x: x.count() > 1).astype('bool')]
    authorset = set()
    for a in successful_authors.index:
        authorset.add(successful_authors.ix[a]['author'])
    active_users = len(authorset)
    #print "number of authors with more than 1 submission in the top 1000: ", active_users
    if dist_authors >0:
        succ_ratio = float(active_users) / dist_authors
    else:
        succ_ratio = 0
    return active_users, dist_authors, succ_ratio
    
#get the values for all types of data
authorstats = {}
for ctype in types:
    curr_df = big_table[big_table['type'] == ctype]
    authorstats[ctype] = get_sub_stats(curr_df)
del curr_df #reduce memory

'''
plots a scatterplot for a list of subreddit stats calculated before
X-Axis: Number of distinct users
Y-Axis: Success ratio
'''
def plot_author_success(successlist):
    xvals = [value[0] for key, value in successlist.iteritems()]
    yvals = [value[2] for key, value in successlist.iteritems()]
    labellist = [key for key, value in successlist.iteritems()]
    
    fig, ax = plt.subplots()
    ax.scatter(xvals, yvals)
    
    for i, txt in enumerate(labellist):
        ax.annotate(txt, (xvals[i],yvals[i]))
    plt.title("Active Users with their success rate")
    plt.xlabel("No. distinct users")
    plt.ylabel("fraction of users with multiple posts")
    remove_border()

plot_author_success(authorstats)

It is evident from this plot that the filterings with the most active users are actually top_day and top_week. When looking at top_day, one would normally expect it to have a similar success rate to new, given that the scores of every post are random. However, there are some users who provide higher quality posts, which reveals to us that they may have properties indicative of the success of their posts - as long as we manage to find these authors in the model later!

Now, we consider whether all subreddits we have looked at have the same fraction of successful users or whether there are subreddits that have a userbase with more success. Perhaps some subreddits have smaller user bases, or perhaps power users or experienced users know how to get more points in their respective forums. An interesting point to explore indeed - we proceed to plot this for all subreddits we are studying.

In [6]:
subreddits = list(big_table['subreddit'].unique())
sr_stats = {}
for ctype in subreddits:
    curr_df = big_table[big_table['subreddit'] == ctype]
    sr_stats[ctype] = get_sub_stats(curr_df)
del curr_df #reduce memory
plot_author_success(sr_stats)
del sr_stats #reduce memory

It is clear from this plot that the story subreddits seem to have more successful users. We assume that this is due to a very active userbase where users frequently post. The question subreddits seem to be far more random in what posts are successful - if we begin predicting subreddits seperately we should take this into account. We should probably begin our focus on the less random subreddits.

The next step in the analysis of the data was to look at the combination of the two most important measurements of success - the number of comments and the score of a post. We begin by plotting this relationship for all the data in our dataset.

In [7]:
#regression line
m_fit,b_fit = plt2.polyfit(big_table.comments, big_table.score, 1) 
plt2.plot(big_table.comments, big_table.score, 'yo', big_table.comments, m_fit*big_table.comments+b_fit, color='purple', alpha=0.3) 
plt.title("Comments versus Score")
plt.xlabel("Comments")
plt.ylabel("Score")
plt.xlim(-10, max(big_table.comments) * 1.05)
plt.ylim(-10, max(big_table.score) * 1.05 )
remove_border()

It is evident that there exists a small linear correlation between number of comments and the score. We can also see that the line of best fit doesn't really seem to fit most posts. This is due to the high number of unsuccessful posts in the bottom left part of the plot. There also seems to be a magical border of roughly 2000 - 2500 score that posts rarely cross, no matter how many comments the post has and in which subreddit it is in. We suspenct that this is due to the small number of people in the reddit community that actually participate in up and downvoting and vote on every post. The majority of the people will rarely vote on anything, only on really great content which would cause the post to break the 2500 score border. We might want to further investigate this later - what causes people that generally not vote to vote on content.

This visualization, however, doesn't help us to understand the correlations between score and comments as the data we look at is very sparse and is missing all older posts with less than 2,000 score. We can't access those posts with the API so we need to look at smaller pieces of the data.

In addition to this graph, we plotted this relationship for all subreddits. These plots can be found in the notebook datavisualization.ipynb.

The next thing we will investigate now is whether there is a correlation between the comments and the score of a post when both are very low since much of our data consists of exactly this kind of data.

In [8]:
big_table_filtered = big_table[big_table['comments'] < 50] #only look at posts with <50 comments
big_table_filtered = big_table_filtered[big_table_filtered['score'] < 100] # and less than 100 score

plt.scatter(big_table_filtered.comments, big_table_filtered.score, alpha=0.2)
plt.title("Comments versus Score")
plt.xlabel("Comments")
plt.ylabel("Score")
plt.xlim(-1, max(big_table_filtered.comments) * 1.05)
plt.ylim(-1, max(big_table_filtered.score) * 1.05 )
remove_border()
del big_table_filtered

When visualizing filtered data where we filtered out successful posts you can see that there seems to be no correlation between comments and score since almost every combination of those two exist in that score range. Since we want to predict the post when it still is in the lower bottom part of the chart, the ratio might not be an optimal indicator of whether a posts can and will be successful.

The penultimate test we attempted in this domain before moving on to our "bag-of-words" classifier work was investigating whether it makes a difference if an author puts in an descriptive text in the post. This so called "self" text will be shown when trying to comment on a post.

In [9]:
def split_selftext_DataFrame(df):
    '''
    returns a list with 0 if a post has no selftext and a 1 if it has
    '''
    is_string_list = []
    i = 0
    for idx, record in df['selftext'].iteritems():
        if type(record) == float: #for some reason no selftext is formatted as float
            is_string_list.append(0)
        else:
            is_string_list.append(1)
        
    return is_string_list

big_table['islink'] = split_selftext_DataFrame(big_table)

big_table_link = big_table[big_table['islink'] == 0]
big_table_self = big_table[big_table['islink'] == 1]

def plot_link_vs_self(table_link, table_self):
    '''
    plots a scatterplot of scores and comments for two different datasets
    '''
    p1 = plt.scatter(table_link.comments, table_link.score, color='red', alpha = 0.2)
    p2 = plt.scatter(table_self.comments, table_self.score, color='blue', alpha = 0.2)
    
    plt.legend([p1, p2], ["no self text", "self texts"])
    plt.title("Comments versus Score ")
    plt.xlabel("Comments")
    plt.ylabel("Score")
    plt.ylim(-10, 5000)
    plt.xlim(-10, 30000)
    remove_border()
    
plot_link_vs_self(big_table_link, big_table_self)
del big_table_link
del big_table_self

This visualization clearly shows us that link posts in the set of our subreddits seem to achieve not only higher scores but also higher comment counts than the selftexts - the ones we want to focus on. This does not, however, show any correlation between score and comments between either of them.

Some further visualizations that helped us understand our data on an even deeper level can be found in the notebook datavisualization.ipynb

Let's now begin to perform some exploratory statistical analysis of the data. We found a paper by stanford researchers that also investigated the predictability of scores. We will look into everything they tried and what is possible with our dataset later.

First we tried to use some simple linear regression models on the data, beginning with the impact of a poster's karma and link karma upon the score of their posts. Inherently, this measure is somewhat correlated, but let's see just how much. We will plot this using log scales for both values.

In [10]:
logkrm = np.log(big_table['karma'])
loglinkkrm = np.log(big_table['link_karma'])
logscore = np.log(big_table['score'])

plt.scatter(logkrm, logscore, c='g')
plt.title("Karma versus Score - Both on a Logarithimic Scale")
plt.xlabel("Karma (Log)")
plt.ylabel("Score (Log)")
plt.xlim(-0.5, 16)
plt.ylim(-0.5, 10)
remove_border()
plt.show()
r_row, p_value = pearsonr(big_table['karma'], big_table['score'])
print "Pearson coefficient is " + str(r_row) + " with a p-value of " + str(p_value)

plt.scatter(loglinkkrm, logscore, c='g')
plt.title("Link Karma versus Score - Both on a Logarithimic Scale")
plt.xlabel("Link Karma (Log)")
plt.ylabel("Score (Log)")
plt.xlim(-0.5, 16)
plt.ylim(-0.5, 10)
remove_border()
plt.show()
r_row, p_value = pearsonr(big_table['link_karma'], big_table['score'])
print "Pearson r coefficient is " + str(r_row) + " with a p-value of " + str(p_value)

del logkrm, loglinkkrm, logscore
Pearson coefficient is 0.0868347116634 with a p-value of 5.50788323065e-74

Pearson r coefficient is 0.159307162395 with a p-value of 1.51416044271e-246

Based on these, we were skeptical of the effect that either karma score has upon the score of the post. Sure, as we already found out earlier there seem to be a few "power users" that have a lot of posts with high scores, but they also have quite a few low scoring posts. One thing to keep in mind is that this is for all subreddits, and this result could be larger or smaller for any individual subreddit. Another possibility is that the "power users" are deleted, or null, or lost beacause of some other complication. It is quite clear, however, that despite everything, there does not seem to be a failsafe way to be a consistently successful user - even if most of the posts you create become successful, the tendency for some posts to flop will always outweigh this and the almost square-shaped cloud of points above reaffirms this point. Another thing to check if we want to ultimately build a multiple regression model is whether the karma and link karma are at all correlated. Let's check this

In [11]:
r_row, p_value = pearsonr(big_table['karma'], big_table['link_karma'])
print "Pearson r coefficient is " + str(r_row) + " with a p-value of " + str(p_value)
Pearson r coefficient is 0.0810603052473 with a p-value of 1.11139557717e-64

Interestingly, this is not as highly correlated as one might expect, and the predictive value of both measurements upon the score of one's post seems quite low. Therefore, we do not think that there will be any evident gains from including these statistics in our final model.

We still haven't considered the title length in our data exploration. Maybe this will give us more information

In [12]:
big_table['length']=big_table['comments'] # Done simply to initialize the column "length", so to speak
for i in big_table.index:
    big_table['length'][i]=len(str(big_table['title'][i]))
    
plt.scatter(big_table['length'], big_table['score'], c='g')
plt.title("Post Title Length versus Post Score")
plt.xlabel("Title Length")
plt.ylabel("Score")
plt.xlim(0, 300)
plt.ylim(0, 9000)
remove_border()
plt.show()
r_row, p_value = pearsonr(big_table['length'], big_table['score'])
print "Pearson r coefficient is " + str(r_row) + " with a p-value of " + str(p_value)
    
Pearson r coefficient is 0.194747106686 with a p-value of 0.0

From this, it seems like popular posts have slightly shorter title lengths, but it's safe to say that this might only have some explanatory power.

Now, let's try doing a first pass of modeling the effect of time upon a post's score, which to begin we'll just model with a simple LR model. Reddit stores the time created in a UNIX timestamp in coordinated universal time. We can convert this to a useable format, then find the earliest post, and simply create a new variable which measures days from that original post.

We would like to give credit to this stackexchange post for conversion help and this post for checking the time between two dates.

In [13]:
#p =datetime.utcfromtimestamp(messwith)#.strftime('%Y-%m-%d %H:%M:%S') #Year, Month, Day, Hour, Minute, Second format

dates = list(big_table['time_created'])

#Function to return the time between dates
def convertdate(dates, which):
    dts = []
    for date in dates:
        dts.append(datetime.utcfromtimestamp(date))
    currenttime = datetime.now()
    until = max(dts)
    days = []
    hrs = []
    for date in dts:
        days.append((until-date).days)
        hrs.append((until-date).total_seconds()/3600.0)
    #print "Last post in the data set has a date/time of", until.strftime('%Y-%m-%d %H:%M:%S')
    if which == 'days':
        return days
    elif which == 'hours':
        return hrs
    else:
        print 'Enter days or hours'

big_table['daysfrom'] = convertdate(dates, 'days')
big_table['hoursfrom'] = convertdate(dates, 'hours')

# Color each scatter plot point according to subreddit type
df = big_table

#Set the colors of each category for a nicer looking graph
colors = ['c', 'g', 'y', 'b', 'r', 'm', 'k', 'w']

talldf = df[df['type'] == types[0]]
talldf['color'] = colors[0]
tallcol = list(talldf['color'])
newdf = df[df['type'] == types[1]]
newdf['color'] = colors[1]
newcol= list(newdf['color'])
hotdf = df[df['type'] == types[2]]
hotdf['color'] = colors[2]
hotcol= list(hotdf['color'])
tweekdf = df[df['type'] == types[3]]
tweekdf['color'] = colors[3]
tweekcol= list(tweekdf['color'])
tdaydf = df[df['type'] == types[4]]
tdaydf['color'] = colors[4]
tdaycol= list(tdaydf['color'])

#Plot time vs. score

tall = plt.scatter(talldf['daysfrom'], talldf['score'], c=tallcol)
new = plt.scatter(newdf['daysfrom'], newdf['score'], c=newcol)
hot = plt.scatter(hotdf['daysfrom'], hotdf['score'], c=hotcol)
tweek = plt.scatter(tweekdf['daysfrom'], tweekdf['score'], c=tweekcol)
tday = plt.scatter(tdaydf['daysfrom'], tdaydf['score'], c=tdaycol)
plt.title("Post Date (in Days posted before November, 11 2013) versus Post Score")
plt.xlabel("Number of Days posted before last post date")
plt.ylabel("Score (Upvotes-Downvotes)")
plt.xlim(0, 2100)
plt.ylim(0, 9000)
plt.legend((tall, new, hot, tweek, tday),
           ('Top all', 'New', 'Hot', 'Top Weekly', 'Top Day'),
           loc='upper right')
remove_border()
plt.show()
r_row, p_value = pearsonr(talldf['length'], talldf['score'])
print "Pearson r coefficient for top all is " + str(r_row) + " with a p-value of " + str(p_value)