import nltk
import numpy as np
import pandas as pd
from csv import QUOTE_ALL
from datetime import datetime
from io import BytesIO
from random import choice, random, randrange, sample, randint
from urllib2 import urlopen
from zipfile import ZipFile
/home/abarto/.virtualenvs/pandas/local/lib/python2.7/site-packages/pandas/io/excel.py:626: UserWarning: Installed openpyxl is not supported at this time. Use >=1.6.1 and <2.0.0. .format(openpyxl_compat.start_ver, openpyxl_compat.stop_ver))
The purpose of this notebook is to show you how to generate a table with data that looks like comments on a blogpost (or posts to a forum if you're old like me). Although we could have generated purely random strings, we wanted the data to look as real as possible, so we make use of the data published by the United States Census Bureau to simulated the entries based on their real probability of occurrence.
The generated data will the form of a Pandas DataFrame with the following columns:
Name | Description |
---|---|
id | An autogenerated sequence number |
timestamp | Timestamp of the date and time when the post was made |
The e-mail of the user | |
first_name | First name of the user |
last_name | Last name of the user |
place | Place of residence of the user |
text | The actual text of the post |
In order to generate meaningful text, we'll make use of the NLTK library by choosing random sentences from any of the available corpora.
The following modules are required to run this notebook:
As we mentioned in the introduction, we'll use data from the US Census Bureau. For the place of residence of the user we'll take random entries from the 2013 Gazeteer files. Although this is not statistically appropriate as the population density of Lost Springs, Wyoming is quite different from New York City, we wanted to keep the post as simple as possible. On the next section we'll use a proper method for generating names that can be adapted for places (you need to get the population estimates data sets for that to work).
with ZipFile(BytesIO(urlopen('http://www2.census.gov/geo/gazetteer/2013_Gazetteer/2013_Gaz_place_national.zip').read())) as zip_file:
gaz_place_national_2013_df = pd.read_csv(zip_file.open('2013_Gaz_place_national.txt'), sep='\t')
gaz_place_national_2013_df.head()
USPS | GEOID | ANSICODE | NAME | LSAD | FUNCSTAT | ALAND | AWATER | ALAND_SQMI | AWATER_SQMI | INTPTLAT | INTPTLONG | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AL | 100100 | 2582661 | Abanda CDP | 57 | S | 7764034 | 34284 | 2.998 | 0.013 | 33.091627 | -85.527029 |
1 | AL | 100124 | 2403054 | Abbeville city | 25 | A | 40255362 | 107642 | 15.543 | 0.042 | 31.564689 | -85.259124 |
2 | AL | 100460 | 2403063 | Adamsville city | 25 | A | 65064187 | 29719 | 25.121 | 0.011 | 33.605750 | -86.974650 |
3 | AL | 100484 | 2405123 | Addison town | 43 | A | 9753292 | 83417 | 3.766 | 0.032 | 34.202681 | -87.178004 |
4 | AL | 100676 | 2405125 | Akron town | 43 | A | 1776164 | 13849 | 0.686 | 0.005 | 32.879495 | -87.741679 |
Notice that the state of each place appears abbreviated, if we wanted the full name, we can make use of the ANSI State Codes provided by the US Census Bureau.
state_df = pd.read_csv(urlopen('http://www.census.gov/geo/reference/docs/state.txt'), sep='|', dtype={'STATE': 'str'})
state_df.head()
STATE | STUSAB | STATE_NAME | STATENS | |
---|---|---|---|---|
0 | 01 | AL | Alabama | 1779775 |
1 | 02 | AK | Alaska | 1785533 |
2 | 04 | AZ | Arizona | 1779777 |
3 | 05 | AR | Arkansas | 68085 |
4 | 06 | CA | California | 1779778 |
places_df = pd.merge(gaz_place_national_2013_df, state_df[['STATE_NAME', 'STUSAB']], left_on='USPS', right_on='STUSAB')[['USPS', 'NAME', 'STATE_NAME']]
places_df.head()
USPS | NAME | STATE_NAME | |
---|---|---|---|
0 | AL | Abanda CDP | Alabama |
1 | AL | Abbeville city | Alabama |
2 | AL | Adamsville city | Alabama |
3 | AL | Addison town | Alabama |
4 | AL | Akron town | Alabama |
For the names of the people we'll use the 1990 (for given names) and 2000 (for last names) census data. This time we won't be choosing entries willy-nilly. We want the names and last names frequency to mimic what happens in real life. In order to do that, we'll build a Cumulative Frequency Distribution for each data set. In order to save some memory, we'll only take the top 50 names and last names.
with ZipFile(BytesIO(urlopen('https://www.census.gov/genealogy/www/data/2000surnames/names.zip').read())) as zip_file:
app_c_df = pd.read_csv(zip_file.open('app_c.csv'))
app_c_df_50 = app_c_df[:50][['name', 'count']]
app_c_df_50['prop'] = app_c_df_50['count'].apply(lambda x : x.astype(float) / app_c_df_50['count'].sum())
app_c_df_50['cfd'] = app_c_df_50['prop'].cumsum()
dist_female_first_df = pd.read_fwf(
urlopen('http://www.census.gov/genealogy/www/data/1990surnames/dist.female.first'),
col_specs=((0, 15),(15, 20),(21, 27),(28, 35)), header=None,
names=('name', 'freq_in_percent', 'cumulative_freq_in_percent', 'rank')
)
dist_female_first_df_50 = dist_female_first_df[:50][['name', 'freq_in_percent']]
dist_female_first_df_50['prop'] = dist_female_first_df_50['freq_in_percent'].apply(lambda x : x.astype(float) / dist_female_first_df_50['freq_in_percent'].sum())
dist_female_first_df_50['cfd'] = dist_female_first_df_50['prop'].cumsum()
dist_female_first_df_50.head()
name | freq_in_percent | prop | cfd | |
---|---|---|---|---|
0 | MARY | 2.629 | 0.088032 | 0.088032 |
1 | PATRICIA | 1.073 | 0.035930 | 0.123962 |
2 | LINDA | 1.035 | 0.034657 | 0.158619 |
3 | BARBARA | 0.980 | 0.032815 | 0.191435 |
4 | ELIZABETH | 0.937 | 0.031376 | 0.222810 |
dist_male_first_df = pd.read_fwf(
urlopen('http://www.census.gov/genealogy/www/data/1990surnames/dist.male.first'),
col_specs=((0, 15),(15, 20),(21, 27),(28, 35)), header=None,
names=('name', 'freq_in_percent', 'cumulative_freq_in_percent', 'rank')
)
dist_male_first_df_50 = dist_male_first_df[:50][['name', 'freq_in_percent']]
dist_male_first_df_50['prop'] = dist_male_first_df_50['freq_in_percent'].apply(lambda x : x.astype(float) / dist_male_first_df_50['freq_in_percent'].sum())
dist_male_first_df_50['cfd'] = dist_male_first_df_50['prop'].cumsum()
dist_male_first_df_50.head()
name | freq_in_percent | prop | cfd | |
---|---|---|---|---|
0 | JAMES | 3.318 | 0.070376 | 0.070376 |
1 | JOHN | 3.271 | 0.069379 | 0.139754 |
2 | ROBERT | 3.143 | 0.066664 | 0.206418 |
3 | MICHAEL | 2.629 | 0.055762 | 0.262180 |
4 | WILLIAM | 2.451 | 0.051986 | 0.314166 |
In order to mimic what a real blogs or sites usually have, we'll generate a user table and then we'll choose one of them as the author of the simulated post. First we'll generate a list of first names (with a equal distribution of males and females) and last names (yes, I know that there's a correlation of first and last names if we take into account the ethnic origin, but we'll ignore that fact).
users = []
emails = set()
email_domains = ('@gmail.com', '@yahoo.com', '@hotmail.com', '@outlook.com', '@mail.com', '@inbox.com', '@yandex.com')
for i in range(500):
user = dict()
# Name
random_lastname = random()
user['last_name'] = app_c_df_50[random_lastname < app_c_df_50.cfd].iloc[0]['name'].capitalize()
random_gender = random()
random_name = random()
if random_gender < 0.5:
user['first_name'] = dist_female_first_df_50[random_name < dist_female_first_df_50.cfd].iloc[0]['name'].capitalize()
else:
user['first_name'] = dist_male_first_df_50[random_name < dist_male_first_df_50.cfd].iloc[0]['name'].capitalize()
# E-mail
email_domain = choice(email_domains)
email = '{0}.{1}{2}'.format(user['first_name'].lower(), user['last_name'].lower(), email_domain)
if not email in emails:
user['email'] = email
else:
user['email'] = '{0}.{1}_{2:4x}{3}'.format(
user['first_name'].lower(), user['last_name'].lower(), randrange(16**4), email_domain
)
emails.add(user['email'])
# Place
place = places_df.ix[np.random.choice(places_df.index.values)]
user['place'] = '{0}, {1}'.format(place['NAME'], place['STATE_NAME']).title()
users.append(user)
users_df = pd.DataFrame(users)
users_df.head()
first_name | last_name | place | ||
---|---|---|---|---|
0 | anna.davis@yahoo.com | Anna | Davis | Climax Springs Village, Missouri |
1 | richard.davis@inbox.com | Richard | Davis | Moapa Valley Cdp, Nevada |
2 | amanda.nelson@mail.com | Amanda | Nelson | County Center Cdp, Virginia |
3 | david.jackson@gmail.com | David | Jackson | Decatur City, Texas |
4 | carol.miller@yandex.com | Carol | Miller | Marshall City, Michigan |
In order to generate the text of the comments, we'll take random sentences from Mary Shelley's Frankenstein; Or, The Modern Prometheu. In order NLTK's sent_tokenize function. For the timestamps of the messages, we'll just pick a random point in time between September 4th 1994 and January 4th 1995 (right around the time [http://www.imdb.com/title/tt0109836/](that other Frankenstein was released).
frankenstein_sentences = nltk.sent_tokenize(urlopen('http://www.gutenberg.org/ebooks/84.txt.utf-8').read().replace('\r\n', ' '))
start_datetime = datetime(year=1994,month=9,day=4).toordinal()
end_datetime = datetime(year=1995,month=1,day=4).toordinal()
comments = []
for i in range(1000):
comment = dict()
comment['timestamp'] = randrange(start_datetime, end_datetime)
comment['text'] = ' '.join(sample(frankenstein_sentences, randint(1, 5)))
user = users_df.ix[np.random.choice(users_df.index.values)]
comment['email'] = user['email']
comment['first_name'] = user['first_name']
comment['last_name'] = user['last_name']
comment['place'] = user['place']
comments.append(comment)
comments_df = pd.DataFrame(sorted(comments, key=lambda p: p['timestamp']))
comments_df.index.name = 'id'
comments_df.head()
first_name | last_name | place | text | timestamp | ||
---|---|---|---|---|---|---|
id | ||||||
0 | sharon.anderson@yandex.com | Sharon | Anderson | Springville Village, New York | "Felix had accidentally been present at the tr... | 728175 |
1 | barbara.nelson@inbox.com | Barbara | Nelson | Prince Cdp, West Virginia | She welcomed me with the greatest affection. T... | 728175 |
2 | timothy.wright@yahoo.com | Timothy | Wright | East Burke Cdp, Vermont | When it became noon, and the sun rose higher, ... | 728175 |
3 | carol.jackson@mail.com | Carol | Jackson | Montrose City, South Dakota | Nay, these are virtuous and immaculate beings!... | 728175 |
4 | laura.jackson@hotmail.com | Laura | Jackson | Smithfield Borough, Pennsylvania | "But my toils now drew near a close, and in tw... | 728175 |
We've generated a Pandas DataFrame with data that looks like comments to a blogpost. There are tons of ways to improve the quality of the data. For instance, we could have used bigger name and last name tables, generated the text using Markov chains (ideally trained from real comments), or distribute the posts unevenly across users. The last thing we need to do is save our work in CSV format:
comments_df.to_csv('comments_df.csv', quoting=QUOTE_ALL)