In [1]:

import nltk
import numpy as np
import pandas as pd

from csv import QUOTE_ALL
from datetime import datetime
from io import BytesIO
from random import choice, random, randrange, sample, randint
from urllib2 import urlopen
from zipfile import ZipFile

/home/abarto/.virtualenvs/pandas/local/lib/python2.7/site-packages/pandas/io/excel.py:626: UserWarning: Installed openpyxl is not supported at this time. Use >=1.6.1 and <2.0.0.
  .format(openpyxl_compat.start_ver, openpyxl_compat.stop_ver))

Generating Fake Blog Comments¶

Introduction¶

The purpose of this notebook is to show you how to generate a table with data that looks like comments on a blogpost (or posts to a forum if you're old like me). Although we could have generated purely random strings, we wanted the data to look as real as possible, so we make use of the data published by the United States Census Bureau to simulated the entries based on their real probability of occurrence.

The generated data will the form of a Pandas DataFrame with the following columns:

Name	Description
id	An autogenerated sequence number
timestamp	Timestamp of the date and time when the post was made
email	The e-mail of the user
first_name	First name of the user
last_name	Last name of the user
place	Place of residence of the user
text	The actual text of the post

In order to generate meaningful text, we'll make use of the NLTK library by choosing random sentences from any of the available corpora.

Requirements¶

The following modules are required to run this notebook:

pandas (0.14.0 or greater)
nltk (2.0.4 or greater)

Census Data¶

As we mentioned in the introduction, we'll use data from the US Census Bureau. For the place of residence of the user we'll take random entries from the 2013 Gazeteer files. Although this is not statistically appropriate as the population density of Lost Springs, Wyoming is quite different from New York City, we wanted to keep the post as simple as possible. On the next section we'll use a proper method for generating names that can be adapted for places (you need to get the population estimates data sets for that to work).

Places¶

In [2]:

with ZipFile(BytesIO(urlopen('http://www2.census.gov/geo/gazetteer/2013_Gazetteer/2013_Gaz_place_national.zip').read())) as zip_file:
    gaz_place_national_2013_df = pd.read_csv(zip_file.open('2013_Gaz_place_national.txt'), sep='\t')
gaz_place_national_2013_df.head()

Out[2]:

	USPS	GEOID	ANSICODE	NAME	LSAD	FUNCSTAT	ALAND	AWATER	ALAND_SQMI	AWATER_SQMI	INTPTLAT	INTPTLONG
0	AL	100100	2582661	Abanda CDP	57	S	7764034	34284	2.998	0.013	33.091627	-85.527029
1	AL	100124	2403054	Abbeville city	25	A	40255362	107642	15.543	0.042	31.564689	-85.259124
2	AL	100460	2403063	Adamsville city	25	A	65064187	29719	25.121	0.011	33.605750	-86.974650
3	AL	100484	2405123	Addison town	43	A	9753292	83417	3.766	0.032	34.202681	-87.178004
4	AL	100676	2405125	Akron town	43	A	1776164	13849	0.686	0.005	32.879495	-87.741679

Notice that the state of each place appears abbreviated, if we wanted the full name, we can make use of the ANSI State Codes provided by the US Census Bureau.

In [3]:

state_df = pd.read_csv(urlopen('http://www.census.gov/geo/reference/docs/state.txt'), sep='|', dtype={'STATE': 'str'})
state_df.head()

Out[3]:

	STATE	STUSAB	STATE_NAME	STATENS
0	01	AL	Alabama	1779775
1	02	AK	Alaska	1785533
2	04	AZ	Arizona	1779777
3	05	AR	Arkansas	68085
4	06	CA	California	1779778

In [4]:

places_df = pd.merge(gaz_place_national_2013_df, state_df[['STATE_NAME', 'STUSAB']], left_on='USPS', right_on='STUSAB')[['USPS', 'NAME', 'STATE_NAME']]
places_df.head()

Out[4]:

	USPS	NAME	STATE_NAME
0	AL	Abanda CDP	Alabama
1	AL	Abbeville city	Alabama
2	AL	Adamsville city	Alabama
3	AL	Addison town	Alabama
4	AL	Akron town	Alabama

People¶

For the names of the people we'll use the 1990 (for given names) and 2000 (for last names) census data. This time we won't be choosing entries willy-nilly. We want the names and last names frequency to mimic what happens in real life. In order to do that, we'll build a Cumulative Frequency Distribution for each data set. In order to save some memory, we'll only take the top 50 names and last names.

Last names¶

In [5]:

with ZipFile(BytesIO(urlopen('https://www.census.gov/genealogy/www/data/2000surnames/names.zip').read())) as zip_file:
    app_c_df = pd.read_csv(zip_file.open('app_c.csv'))

In [6]:

app_c_df_50 = app_c_df[:50][['name', 'count']]
app_c_df_50['prop'] = app_c_df_50['count'].apply(lambda x : x.astype(float) / app_c_df_50['count'].sum())
app_c_df_50['cfd'] = app_c_df_50['prop'].cumsum()

Female names¶

In [7]:

dist_female_first_df = pd.read_fwf(
    urlopen('http://www.census.gov/genealogy/www/data/1990surnames/dist.female.first'),
    col_specs=((0, 15),(15, 20),(21, 27),(28, 35)), header=None,
    names=('name', 'freq_in_percent', 'cumulative_freq_in_percent', 'rank')
)

In [8]:

dist_female_first_df_50 = dist_female_first_df[:50][['name', 'freq_in_percent']]
dist_female_first_df_50['prop'] = dist_female_first_df_50['freq_in_percent'].apply(lambda x : x.astype(float) / dist_female_first_df_50['freq_in_percent'].sum())
dist_female_first_df_50['cfd'] = dist_female_first_df_50['prop'].cumsum()
dist_female_first_df_50.head()

Out[8]:

	name	freq_in_percent	prop	cfd
0	MARY	2.629	0.088032	0.088032
1	PATRICIA	1.073	0.035930	0.123962
2	LINDA	1.035	0.034657	0.158619
3	BARBARA	0.980	0.032815	0.191435
4	ELIZABETH	0.937	0.031376	0.222810

Male names¶

In [9]:

dist_male_first_df = pd.read_fwf(
    urlopen('http://www.census.gov/genealogy/www/data/1990surnames/dist.male.first'),
    col_specs=((0, 15),(15, 20),(21, 27),(28, 35)), header=None,
    names=('name', 'freq_in_percent', 'cumulative_freq_in_percent', 'rank')
)

In [10]:

dist_male_first_df_50 = dist_male_first_df[:50][['name', 'freq_in_percent']]
dist_male_first_df_50['prop'] = dist_male_first_df_50['freq_in_percent'].apply(lambda x : x.astype(float) / dist_male_first_df_50['freq_in_percent'].sum())
dist_male_first_df_50['cfd'] = dist_male_first_df_50['prop'].cumsum()
dist_male_first_df_50.head()

Out[10]:

	name	freq_in_percent	prop	cfd
0	JAMES	3.318	0.070376	0.070376
1	JOHN	3.271	0.069379	0.139754
2	ROBERT	3.143	0.066664	0.206418
3	MICHAEL	2.629	0.055762	0.262180
4	WILLIAM	2.451	0.051986	0.314166

Generating the User Table¶

In order to mimic what a real blogs or sites usually have, we'll generate a user table and then we'll choose one of them as the author of the simulated post. First we'll generate a list of first names (with a equal distribution of males and females) and last names (yes, I know that there's a correlation of first and last names if we take into account the ethnic origin, but we'll ignore that fact).

In [11]:

users = []
emails = set()
email_domains = ('@gmail.com', '@yahoo.com', '@hotmail.com', '@outlook.com', '@mail.com', '@inbox.com', '@yandex.com')
for i in range(500):
    user = dict()
    
    # Name

    random_lastname = random()
    
    user['last_name'] = app_c_df_50[random_lastname < app_c_df_50.cfd].iloc[0]['name'].capitalize()

    random_gender = random()
    random_name = random()
    
    if random_gender < 0.5:
        user['first_name'] = dist_female_first_df_50[random_name < dist_female_first_df_50.cfd].iloc[0]['name'].capitalize()
    else:
        user['first_name'] = dist_male_first_df_50[random_name < dist_male_first_df_50.cfd].iloc[0]['name'].capitalize()
    
    # E-mail
    
    email_domain = choice(email_domains)
    email = '{0}.{1}{2}'.format(user['first_name'].lower(), user['last_name'].lower(), email_domain)
    
    if not email in emails:
        user['email'] = email
    else:
        user['email'] = '{0}.{1}_{2:4x}{3}'.format(
            user['first_name'].lower(), user['last_name'].lower(), randrange(16**4), email_domain
        )
    
    emails.add(user['email'])
    
    # Place
    
    place = places_df.ix[np.random.choice(places_df.index.values)]
    
    user['place'] = '{0}, {1}'.format(place['NAME'], place['STATE_NAME']).title()
        
    users.append(user)

In [12]:

users_df = pd.DataFrame(users)
users_df.head()

Out[12]:

	email	first_name	last_name	place
0	anna.davis@yahoo.com	Anna	Davis	Climax Springs Village, Missouri
1	richard.davis@inbox.com	Richard	Davis	Moapa Valley Cdp, Nevada
2	amanda.nelson@mail.com	Amanda	Nelson	County Center Cdp, Virginia
3	david.jackson@gmail.com	David	Jackson	Decatur City, Texas
4	carol.miller@yandex.com	Carol	Miller	Marshall City, Michigan

Generating comments¶

In order to generate the text of the comments, we'll take random sentences from Mary Shelley's Frankenstein; Or, The Modern Prometheu. In order NLTK's sent_tokenize function. For the timestamps of the messages, we'll just pick a random point in time between September 4th 1994 and January 4th 1995 (right around the time [http://www.imdb.com/title/tt0109836/](that other Frankenstein was released).

In [13]:

frankenstein_sentences = nltk.sent_tokenize(urlopen('http://www.gutenberg.org/ebooks/84.txt.utf-8').read().replace('\r\n', ' '))

In [14]:

start_datetime = datetime(year=1994,month=9,day=4).toordinal()
end_datetime = datetime(year=1995,month=1,day=4).toordinal()

In [15]:

comments = []
for i in range(1000):
    comment = dict()
    
    comment['timestamp'] = randrange(start_datetime, end_datetime)
    comment['text'] = ' '.join(sample(frankenstein_sentences, randint(1, 5)))
    
    user = users_df.ix[np.random.choice(users_df.index.values)]
    
    comment['email'] = user['email']
    comment['first_name'] = user['first_name']
    comment['last_name'] = user['last_name']
    comment['place'] = user['place']
    
    comments.append(comment)

In [16]:

comments_df = pd.DataFrame(sorted(comments, key=lambda p: p['timestamp']))
comments_df.index.name = 'id'
comments_df.head()

Out[16]:

	email	first_name	last_name	place	text	timestamp
id
0	sharon.anderson@yandex.com	Sharon	Anderson	Springville Village, New York	"Felix had accidentally been present at the tr...	728175
1	barbara.nelson@inbox.com	Barbara	Nelson	Prince Cdp, West Virginia	She welcomed me with the greatest affection. T...	728175
2	timothy.wright@yahoo.com	Timothy	Wright	East Burke Cdp, Vermont	When it became noon, and the sun rose higher, ...	728175
3	carol.jackson@mail.com	Carol	Jackson	Montrose City, South Dakota	Nay, these are virtuous and immaculate beings!...	728175
4	laura.jackson@hotmail.com	Laura	Jackson	Smithfield Borough, Pennsylvania	"But my toils now drew near a close, and in tw...	728175

Conclusions¶

We've generated a Pandas DataFrame with data that looks like comments to a blogpost. There are tons of ways to improve the quality of the data. For instance, we could have used bigger name and last name tables, generated the text using Markov chains (ideally trained from real comments), or distribute the posts unevenly across users. The last thing we need to do is save our work in CSV format:

In [ ]:

comments_df.to_csv('comments_df.csv', quoting=QUOTE_ALL)

In [ ]: