This analysis attempts to measure the progress of Powerpoetry users, an online poetry platform, along several dimensions including rhyming, ngrams and word frequencies such as positive/negative sentiment or use of abstract words and political words among others. The data analyzed contains 128000 poems spanning from Aug-2012 to Feb-2014, during which the platform has experienced an exponential growth path.
We see some evidence that the poets use more rhyming in their poems more as they progress, as evidenced from the beta distribution of slant rhyme frequency. There is also some evidence to suggest that poets make less spelling mistakes, use more abstract words, and utilize more words connoting overconfidence. Less grammatical errors is certainly something desirable. An increase in frequency of abstract words might indicate an increasing sophistication in poetic skills. In addition, an increase in the overconfidence metric might suggest that the poets become more confident about their feelings and express them more explicitly as they become more proficient. Because the analysis targets return users, which is a small subset of entire user base, the findings might not generalize well onto the larger population.
Secondly, we ask the question whether poems posted from richer neighborhoods show higher literacy characteristics compared to the ones from poorer neighborhoods. We utilize Powerpoetry data that includes user location information on the ZIP code level. We enrich the dataset by appending Census demographics data by Tract Level to approximate the income levels for the given user. The analysis suggests that higher income levels might be associated with better literacy skills. There are two caveats associated with this finding. First, the zip-code / tract-level location might not be granular enough and come with a high variance in terms of income levels. Thus, the median income level might not be a good representation for the income level of given poet. Secondly, trigram frequency might not be the best metrics to measure the language sophistication. In the future iterations, we might need to design a more refined metric to measure the literacy levels given the poem.
import json
import pandas as pd
from itertools import *
from matplotlib import cm
import numpy as np
from scipy import stats
import statsmodels.api as sm
import matplotlib
from matplotlib import pyplot as plt
from IPython.core.pylabtools import figsize
%matplotlib inline
ERROR: Line magic function `%matplotlib` not found.
poetry_features = pd.read_csv('data/poetry_features.csv')
#Dropping the missing data
poetry_features = poetry_features.dropna(axis=0)
poetry_features = poetry_features.drop(['punctFreq','alliterFreq','unigramFreq'],axis=1)
Powerpoetry has experienced a substantial jump in traffic with 10000 poems on average added to the website on a monthly basis in latter part of 2013.
from datetime import datetime
dt = [datetime.utcfromtimestamp(poetry_features.created.iloc[n]) for n,line in enumerate(poetry_features.created)]
pi = pd.PeriodIndex([pd.Period(d,'D') for d in dt])
monthly_ = pd.DataFrame(np.ones(len(pi)),index=pi).resample('M',how='sum')
ax = monthly_.plot(kind='bar',color=["#7A68A6"],label='test',alpha=0.7)
ax.set_ylabel('Number of Poems')
ax.set_xlabel('Month')
ax.set_title('Number of Poems posted on the Powerpoetry website')
<matplotlib.text.Text at 0x10c9eda50>
Here is a brief description for each variable in our feature space. The data is available upon request.
Rhyming:
'perfectRhymeFreq': Frequency of perfect rhyming in the poem.
'slantRhymeFreq' : Frequency of slant rhyming in the poem.
'alliterFreq': Frequency of alliteration rhyming in the poem.
Word Frequencies:
Please refer to Harvard General Inquirer for more details.
'ABS' : Frequency of abstract words in the poem.
'EnlTot',: Frequency of words referring to knowledge, insight, and information concerning personal and cultural relations.
'Female': Frequency of words referring to women and social roles associated with women
'Male' : Frequency of words referring men to and social roles associated with men
'Object': References to objects
'Polit': Freqency of words having a clear political character, including political roles, collectivities, acts, ideas, ideologies, and symbols.
'Race': Frequency of words referring to racial or ethnic characteristics.
'Relig': Words pertaining to religious, metaphysical, supernatural or relevant philosophical matters.
'St': Word connoting overstated / understated expressions
'WlbPhycs': Words connoting the physical aspects of well being, including its absence.
'WlbPsyc': Words connoting the psychological aspects of well being, including its absence.
'PosNeg': 'Words with positive and negative outlook'
Ngrams:
COCA Corpus was used in the analysis.
'unigramFreq': Average occurrence of a word in the poem in the English language in terms of percentage.
'bigramFreq' : Log average occurrence count of combination of two words in the poem in the English language
'trigramFreq': Log average occurrence count of combination of three words in the poem in the English language
'misspeltWord': Frequnecy of misspelt words in the poem.
'sentence_count': Sentence count in the poem.
'wordCount': Log word count in the poem.
'punctFreq': Punctutation frequency in the poem.
One of our takeaways in the process was the prevalence of dumping behavior for most Powerpoetry users, which is defined as posting multiple poems in a short time frame. We think that the users post lot of poems at once to increase their chances in competitions that occasionally take place on the website. Because the progress in the analysis is measured over sequence of poems posted, we want to control for this behavior by taking the mean of multiple entries on the same day. Although this is not the perfect way to correct for the bias, we think that it would still allow to measure progress.
The code below achieves this correction.
#Daily Period Index
poetry_features['periodindex'] = pd.PeriodIndex([pd.Period(d,'D') for d in dt])
#Group By user id and PeriodIndex. Take the Mean to take into account dumping behavior.
grouped = poetry_features.groupby(['uid','periodindex']).aggregate(np.mean).sort_index()
ut = grouped.index
#Reindex to count from date
count_ = pd.DataFrame(grouped.index.tolist())[0].value_counts().sort_index()
count = [i+1 for user,x in count_.iteritems() for i in range(x)]
#Create indexes with user id, period and poem number to be used later in the analysis>
nix = [(u[0],u[1],c)for c,u in zip(count,ut)]
nth = [x[2] for x in nix]
uids = [x_[0] for x_ in nix]
grouped.index = uids #Change the index to user IDs.
#Clean nonrequired columns and admin uids.
def clean(df):
#Clean. #Drop User == 1 and 0 and 4. These users represent admin.
try:
df = df.drop(['created','nid'],axis=1)
df = df.drop([(0,),(1,),(4,)])
except:
pass
return df
grouped = clean(grouped)
Next, we normalize the dataset.
grouped = (grouped - grouped.mean()) / grouped.std()
grouped.describe()
ABS | EnlTot | Female | Male | Object | Polit | PosNeg | Race | Relig | St | WlbPhycs | WlbPsyc | bigramFreq | misspeltWord | perfectRhymeFreq | sentence_count | slantRhymeFreq | trigramFreq | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 8.754000e+04 | 8.754000e+04 | 8.754000e+04 | 8.754000e+04 | 8.754000e+04 | 8.754000e+04 | 8.754000e+04 | 8.754000e+04 | 8.754000e+04 | 8.754000e+04 | 8.754000e+04 | 8.754000e+04 | 8.754000e+04 | 8.754000e+04 | 8.754000e+04 | 8.754000e+04 | 8.754000e+04 | 8.754000e+04 |
mean | 8.949534e-14 | -6.799932e-14 | -9.954595e-16 | 6.218333e-15 | 9.712513e-14 | 7.529647e-14 | 7.894048e-16 | 5.525278e-15 | -9.099533e-15 | 1.999176e-14 | 7.724482e-14 | 3.370139e-14 | 6.606466e-15 | -9.997251e-15 | -1.274752e-14 | -7.508537e-15 | -5.560997e-14 | -7.939848e-14 |
std | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 |
min | -1.462259e+00 | -1.859203e+00 | -4.063208e-01 | -4.393735e-01 | -1.589036e+00 | -9.381900e-01 | -1.855670e+01 | -2.724788e-01 | -5.004543e-01 | -1.093147e+01 | -1.090928e+00 | -7.766160e-01 | -1.347624e+01 | -5.741156e-01 | -5.568874e-01 | -7.019004e-01 | -1.092530e+00 | -7.043686e+00 |
25% | -6.164054e-01 | -6.261912e-01 | -4.063208e-01 | -4.393735e-01 | -6.322019e-01 | -6.156999e-01 | -5.352067e-01 | -2.724788e-01 | -5.004543e-01 | -5.839927e-01 | -6.394884e-01 | -7.766160e-01 | -3.804559e-01 | -5.741156e-01 | -5.568874e-01 | -6.153846e-01 | -8.211124e-01 | -3.470040e-01 |
50% | -1.367221e-01 | -1.277747e-01 | -4.063208e-01 | -4.393735e-01 | -1.286048e-01 | -2.082836e-01 | -2.952130e-02 | -2.724788e-01 | -5.004543e-01 | -3.132642e-02 | -1.762670e-01 | -2.101063e-01 | 4.435616e-02 | -2.316264e-01 | -5.568874e-01 | -2.693216e-01 | -1.421871e-01 | 1.299220e-01 |
75% | 4.252693e-01 | 4.688730e-01 | -1.070770e-01 | -2.010037e-02 | 4.663858e-01 | 3.200626e-01 | 5.113170e-01 | -2.724788e-01 | 1.919454e-01 | 5.243673e-01 | 3.933500e-01 | 3.317364e-01 | 4.898397e-01 | 1.867022e-01 | 2.179439e-01 | 2.497730e-01 | 5.366295e-01 | 5.435602e-01 |
max | 3.681263e+01 | 1.372365e+01 | 1.994226e+01 | 1.393905e+01 | 3.012319e+01 | 2.185111e+01 | 1.805389e+01 | 3.205956e+01 | 2.601911e+01 | 2.241015e+01 | 1.944956e+01 | 3.773863e+01 | 4.076357e+00 | 3.022013e+01 | 1.462618e+01 | 9.636878e+01 | 1.031159e+01 | 3.823251e+00 |
Number of poems used in the analysis before/after dumping behavior.
print 'Number of poems {0} on the website'.format(len(poetry_features))
print 'Number of poems {0} on the website after dumping behavior is taken into account'.format(len(grouped))
Number of poems 127023 on the website Number of poems 87540 on the website after dumping behavior is taken into account
#Number of poets posted on the website,
print 'Number of poets {0} on the website'.format(len(set(poetry_features.uid)))
Number of poets 69752 on the website
Let's start with some explanatory analysis. First chart shows the number of return poems on the site on a monthly basis. The goal of measuring the user progress puts the user profile in the heart of our analysis. There seems to be approximately 1500 return users on a monthly basis.
#Count the number of returns
return_traffic = pd.DataFrame(map((lambda x: (x[1], int(x[2] > 1))),nix)).groupby(0).sum().resample('M',how='sum')
#Count the number of first timers.
#first_time_traffic = pd.DataFrame(map((lambda x: (x[1], int(x[2] == 1))),nix)).groupby(0).sum().resample('M')
#Merge
traffic = pd.concat([return_traffic],axis=1)
traffic.columns = ['return']
ax = traffic.plot(kind='bar',color=["#7A68A6"],label='test',alpha=0.7)
ax.set_ylabel('Number of Visitors')
ax.set_xlabel('Month')
ax.set_title('Breakdown of return visitors to Powerpoetry website')
<matplotlib.text.Text at 0x10f701350>
Compared to number of first time poems, which seems to hover around 5000.
first_time_traffic = pd.DataFrame(map((lambda x: (x[1], int(x[2] == 1))),nix)).groupby(0).sum().resample('M',how='sum')
traffic = pd.concat([first_time_traffic],axis=1)
traffic.columns = ['first time']
ax = traffic.plot(kind='bar',color=["#7A68A6"],label='test',alpha=0.7)
ax.set_ylabel('Number of Visitors')
ax.set_xlabel('Month')
ax.set_title('Breakdown of first time visitors to Powerpoetry website')
<matplotlib.text.Text at 0x112b63e10>
Let's also see the distribution of number of poems submitted for each user. The distribution does exhibit log-log charactertics. One user singlehandidly has posted 377 poems on the website. The chart also makes it clear that overwhelming number of poets posted only only one poem.
nthpoem = np.array([i[2] for i in nix])
bin_counts_cumulative = np.bincount(nthpoem)[1:][::-1] #Reverse Order
bin_counts = [bin_count - bin_counts_cumulative[i-1] for i, bin_count in enumerate(bin_counts_cumulative) if i > 0] #Skip the first number
bin_counts = [bin_counts_cumulative[0]] + bin_counts #Append the first entry back
bin_counts.reverse() #Reverse
ind = range(1,1 + len(bin_counts)) #the x Locations for the groups
for i, n in enumerate(bin_counts):
if i < 20:
print 'number of poets who posted {0} poem(s) : {1}'.format(i,n)
#Chart the distribution
fig, ax = plt.subplots()
ax.bar(ind,bin_counts,color=["#7A68A6"],alpha=0.7)
ax.set_ylabel('Number of poets')
ax.set_xlabel('Number of poems')
ax.set_title('Distribution of Activity of Poets')
ax.annotate('Hello most active user !', xy=(ind[-1], 1), xytext=(300, 20000),
arrowprops=dict(facecolor='black', shrink=0.05))
number of poets who posted 0 poem(s) : 59987 number of poets who posted 1 poem(s) : 6723 number of poets who posted 2 poem(s) : 1611 number of poets who posted 3 poem(s) : 675 number of poets who posted 4 poem(s) : 282 number of poets who posted 5 poem(s) : 159 number of poets who posted 6 poem(s) : 79 number of poets who posted 7 poem(s) : 52 number of poets who posted 8 poem(s) : 30 number of poets who posted 9 poem(s) : 21 number of poets who posted 10 poem(s) : 27 number of poets who posted 11 poem(s) : 22 number of poets who posted 12 poem(s) : 13 number of poets who posted 13 poem(s) : 10 number of poets who posted 14 poem(s) : 7 number of poets who posted 15 poem(s) : 5 number of poets who posted 16 poem(s) : 6 number of poets who posted 17 poem(s) : 2 number of poets who posted 18 poem(s) : 10 number of poets who posted 19 poem(s) : 2
<matplotlib.text.Annotation at 0x1110a7550>
Below we will measure the language progress in three dimensions. First, we will calculate the progress for each user in each feature and look at the distribution of betas. A frequency distribution with a mean that is statistically significant from zero might imply progress on average given the particular feature. Secondly, we will take the median score for each poem in the sequence (1st poem, 2nd poem etc) and fit a linear model. A slope that is significant and non-zero might be a sign that the users show some improvement on average. Third, we will employ a A/B test to discern the difference between first set of poems and latter set of poems.
For the analysis we only consider poets who posted at least 10 poems. We then measure the beta for each user in each feature (trigram, bigram, PosNeg etc) and analyze the distribution for each feature.
There are 154 poets with at least 10 poems. We later will see that the result would differ with different cutoff points.
def seasoned(cut,nix,grouped):
ix = filter((lambda x: x[2] >=cut),nix) #seasoned poets with at least 10 poems posted
unique_uids = np.unique([i[0] for i in ix])
unique_nids = [ix_[2] for ix_ in nix if ix_[0] in unique_uids] #All the nids for seasoned
features_ = grouped.loc[unique_uids] #Feature set for seasoned poets.
#print len(unique_uids)
return(features_, unique_nids, unique_uids)
def distribution_betas(seasoned_poets,unique_uids):
'Looping through the dataframe to retrieve the scores for each user at a time and calculate the progress'
sig = 0
betas_user = np.zeros((len(unique_uids),seasoned_poets.shape[1]))
for x,uid in enumerate(unique_uids):
scores = seasoned_poets.loc[uid]
#Iterate over the score
for z,(feature,score) in enumerate(scores.iteritems()):
#Regression coefficients
y = score.values
X = arange(len(score))+1
X = sm.add_constant(X)
res = sm.OLS(y,X).fit() #OLS
#if x < 1: #Show for the first user
# print feature, uid
# print res.summary()
#
betas_user[x,z] = res.params[1]
if res.f_pvalue<0.05: #Take the beta with a p-value threshold!
sig += 1
print '{0} significant results out of {1} regressions'.format(sig,x* z)
#Chart Distribution of Betas
figsize(25, 10)
fig = plt.figure()
#Subplots within a single chart
betas_, columns_ = betas_user.T, seasoned_poets.columns
for k, (beta, column) in enumerate(zip(betas_,columns_)):
sx = plt.subplot(int(betas_.shape[0]/2+1), 2, k+1)
plt.rc('axes', color_cycle=['r', 'g', 'b', 'y'])
#plt.xlabel(seasoned_poets.columns[k])
#Label
plt.text(beta.min(), 40, 'average beta is {0}'.format(round(beta.mean(),2)), fontsize=15)
plt.setp(sx.get_yticklabels(), visible=False)
plt.hist(beta,color=cm.jet(1.*k/len(betas_)), alpha=0.4, bins= np.linspace(beta.min(), beta.max(), 20)) #beta is the all the betas for all the users for each matric.
plt.ylim(0,50)
plt.legend([column])
plt.vlines(0, 0, 500, color="k", linestyles="--", lw=1)
#Significance test
p_val = stats.ttest_1samp(beta, 0)[1]
if p_val < 0.10: plt.text(beta.min(), 30, 'statistically significant at {0}'.format(round(p_val,3)), fontsize=15)
#plt.autoscale(tight=True)
cut = 10
seasoned_poets, unique_nids,unique_uids = seasoned(cut,nix,grouped) #Returns poems from poets with more than n poems. UID is the index
distribution_betas(seasoned_poets,unique_uids)
166 significant results out of 2601 regressions
We see that around 7 percent of the betas are significant with a minimum number of 10 poems! Few things to note here. First, keep in mind that the features are normalized; so betas are somewhat comparable. Secondly, while for some measures positive beta translates to improvement, for others the oppositive is true. For example, we associate a descrease in trigram frequencies with improvement. The assumption is that the user is able to come up with unexpected combination of words as the languge skills improve.
We perform a simple t-test to see whether the progress beta is statistically significantly different from zero. Although the mean betas are mostly close to zero, we see that poets rhyme their poems more as they progress, as evidenced from the beta distribution of slant rhyme frequency. Secondly, we see that the frequency of abstract words increases as poets post more. This is also significant but at 10% level.
There is really no real reason to use at least 10 poems. If we extend the analysis onto different cutoff points (11,12,..) we see that ST and freqeuncy of misspelt words also show improvement. That is to say that poets make less spelling mistakes and utilize more words connoting overconfidence. Less grammatical errors is certainly something desirable. In addition, an increase in the ST metric might suggest that the poets become more confident about their feelings and express them more explicitly as they become more proficient.
cut = 12
seasoned_poets, unique_nids,unique_uids = seasoned(cut,nix,grouped) #Returns poems from poets with more than n poems. UID is the index
distribution_betas(seasoned_poets,unique_uids)
112 significant results out of 1785 regressions
Next, we move to the estimation of the progression slope for each feature on average. This is done by taking the median score for nth poem across all poets before applying a linear fit.
def abbreviate(seasoned_poets,cut):
seasoned_poets_abbreviated = seasoned_poets.copy()
seasoned_poets_abbreviated.index = unique_nids
seasoned_poets_abbreviated['nid'] = unique_nids
#Filter by cutoff point. Take only nth poems up to cut
filter_ = seasoned_poets_abbreviated['nid']<=cut
seasoned_poets_abbreviated = seasoned_poets_abbreviated[filter_] #Take only the first 'cut' poems
seasoned_poets_abbreviated = seasoned_poets_abbreviated.drop('nid',axis=1)
return seasoned_poets_abbreviated
def average_progress_chart(seasoned_poets_abbreviated,unique_nids):
figsize(25, 10)
#print seasoned_poets_abbreviated, ' poets in total'
betas_median_feature = np.zeros(seasoned_poets_abbreviated.shape[1])
fig = plt.figure()
subplots_adjust(hspace=0.,wspace=0.)
for k,(column,feature) in enumerate(seasoned_poets_abbreviated.iteritems()):
#Plot params
sx = plt.subplot(int(betas_median_feature.shape[0]/2+1), 2, k+1)
plt.rc('axes', color_cycle=['r', 'g', 'b', 'y'])
plt.setp(sx.get_yticklabels(), visible=False)
#Median for each poem number first
#print column
median_feature = feature.groupby(feature.index).median()
y = median_feature.values
X = median_feature.index
X = sm.add_constant(X)
res = sm.OLS(y,X).fit()
betas_median_feature[k] = res.params[1]
#Plot the Progress
plt.setp(sx.get_yticklabels(), visible=False)
plt.text(1, y.max(), 'beta of the slope is {:10.4f} with p-value of {:10.4f}'.format(res.params[1],res.f_pvalue), fontsize=12)
#Scatter
sx = plt.scatter(X[:,1],y, alpha=0.9,marker=(6,0))
#Line
x_ = np.unique(X[:,1])
y_ = (res.params[0] + res.params[1]*np.unique(X[:,1]))
#Fit the Line of the regression
col = cm.jet(1.*k/len(betas_median_feature))
plt.plot(x_,y_,color=col)
plt.legend([column])
xlim(0,cut)
plt.autoscale(tight=True)
Again, the choice of the cutoff point is somehow arbitrary. We face a trade-off. The higher the number, the longer we can extend the progress (on to 12th poem, 13th poem etc) but risk losing poets who have not posted the required minimum and hence a smaller number of poets who qualify. Since the distribution of poem count is highly skewed, a large cutoff number increases the risk of the result not generalizing well to the entire population. Here is a graph showing the tradeoff.
npoems = []
cutoff = []
for i in range(8,30):
seasoned_poets, unique_nids,unique_uids = seasoned(i,nix,grouped)
seasoned_poets_abbreviated = abbreviate(seasoned_poets,i)
cutoff.append(i)
npoems.append(seasoned_poets_abbreviated.shape[0]/i)
fig, ax = plt.subplots()
ax.plot(cutoff,npoems)
ax.set_ylabel('Number of poets with at least minimum number of poems')
ax.set_xlabel('Number of poems minimum')
<matplotlib.text.Text at 0x112361350>
Again, arbitrarily, setting the minimum number of poems at 12 looks plausible. Because there are still 100 poems who posted more than the required minimum and also an inference can be made using 12 points in a linear setting.
i = 12
seasoned_poets, unique_nids,unique_uids = seasoned(i,nix,grouped)
seasoned_poets_abbreviated = abbreviate(seasoned_poets,i)
average_progress_chart(seasoned_poets_abbreviated,unique_nids)
In lower cutoff points (i=12), we see that the users take strides in expressing themselves as ST feature (frequenct of overstated/understated expression) increases. We also find that the users seem to be using more abstract words as they post more poems.
With a large required minimum (and fewer poets), we see that sentence count and frequency of political words show an increase. This might show us that users have more things to say as they write and they become political in doing so. Additionaly, using the mean score instead of median score reveals that the positivity of the poems on average improve with the poem count.
On the other hand, we see a detoriation in the frequency of trigrams and bigrams in general. This is opposite of what we would have expected. It could be that our COCA dictionary is failing to pick up combination of less frequent words. In fact, we might need to run the numbers again to incorporate the latest version of COCA, which is much more comprehensive that the previous one that we used.
i = 20
seasoned_poets, unique_nids,unique_uids = seasoned(i,nix,grouped)
seasoned_poets_abbreviated = abbreviate(seasoned_poets,i)
average_progress_chart(seasoned_poets_abbreviated,unique_nids)
Another approach is to divide the progress into two (or more) subsets in order to see if the latter sample is significantly different than the former one. If the means of two samples are different significantly, we can argue that the poets display different characteristics with their latter poems.
We use the non-parametric KOLMOGOROV-SMIRNOV (KS) test because the distribution of the feature is often not normal.Similar to above, the initial step is to take the poets who have posted at least a certain number of poems. Then we compare the first subset against the second subset. We keep the size of two samples equal, because KS test is sensitive to differences in sample size.
The example below compares the first 6 poems vs second 6 poems. Both visually and statistically, it is hard to discern a difference between two samples.
def ab_test(cut,nix,grouped):
seasoned_poets, unique_nids, unique_uids = seasoned(cut,nix,grouped)
seasoned_poets.index = unique_nids
print 'Total number of {0} poets'.format(len(unique_uids))
figsize(25, 10)
fig = plt.figure()
n = cut//2 #Take the first set of poems.
for k,(column,feature) in enumerate(seasoned_poets.iteritems()):
a = feature[seasoned_poets.index<=n]
b = feature[seasoned_poets.index>n]
b = b[b.index<=n*2] #Keep the sample sizes equal
sx = plt.subplot(int(seasoned_poets.shape[1]/2+1), 2, k+1)
plt.setp(sx.get_yticklabels(), visible=False)
lp = np.linspace(min(min(a),min(b)),max(max(a),max(b)),30)
plt.hist(a,bins=lp,alpha=0.5)
plt.hist(b,bins=lp,alpha=0.5)
plt.autoscale(tight=True)
plt.legend([column])
#two_sample = stats.ttest_ind(a, b)
#test_stat = stats.ranksums(a, b)
test_stat = stats.ks_2samp(a, b)
print 'For the feature {2}: The t-statistic is {0} and the p-value is {1}.'.format(test_stat[0],test_stat[1],column)
#Analyze 12 poems - First 6 vs second 6.
ab_test(12,nix,grouped)
Total number of 106 poets For the feature ABS: The t-statistic is 0.062893081761 and the p-value is 0.155863525145. For the feature EnlTot: The t-statistic is 0.0440251572327 and the p-value is 0.559451012376. For the feature Female: The t-statistic is 0.0251572327044 and the p-value is 0.986840984667. For the feature Male: The t-statistic is 0.0424528301887 and the p-value is 0.60633902974. For the feature Object: The t-statistic is 0.0408805031447 and the p-value is 0.653867326542. For the feature Polit: The t-statistic is 0.0377358490566 and the p-value is 0.747985464114. For the feature PosNeg: The t-statistic is 0.0251572327044 and the p-value is 0.986840984667. For the feature Race: The t-statistic is 0.0172955974843 and the p-value is 0.99997744142. For the feature Relig: The t-statistic is 0.0503144654088 and the p-value is 0.387801857974. For the feature St: The t-statistic is 0.0534591194969 and the p-value is 0.31529513865. For the feature WlbPhycs: The t-statistic is 0.0581761006289 and the p-value is 0.225064703162. For the feature WlbPsyc: The t-statistic is 0.0487421383648 and the p-value is 0.427654148908. For the feature bigramFreq: The t-statistic is 0.0455974842767 and the p-value is 0.513764264484. For the feature misspeltWord: The t-statistic is 0.0251572327044 and the p-value is 0.986840984667. For the feature perfectRhymeFreq: The t-statistic is 0.0298742138365 and the p-value is 0.935720416706. For the feature sentence_count: The t-statistic is 0.0393081761006 and the p-value is 0.701351051074. For the feature slantRhymeFreq: The t-statistic is 0.0518867924528 and the p-value is 0.350319744302. For the feature trigramFreq: The t-statistic is 0.059748427673 and the p-value is 0.199780810088.
It is also possible to change the number of poems that we want to compare also because the user on average might achieve progress at different points with different features. For example, a poet might start writing longer poems from 5th poem on but not make permanent gains in trigram frequencies until much later.
Running the analysis with different cutoff points does not provide any useful insights, however. From an A/B test perspective, we cannot say that two samples look different from each other for any given feature.
ab_test(6,nix,grouped)
Total number of 474 poets For the feature ABS: The t-statistic is 0.0295358649789 and the p-value is 0.558478974107. For the feature EnlTot: The t-statistic is 0.0464135021097 and the p-value is 0.0908247320514. For the feature Female: The t-statistic is 0.0253164556962 and the p-value is 0.747159977819. For the feature Male: The t-statistic is 0.0260196905767 and the p-value is 0.716167130478. For the feature Object: The t-statistic is 0.0344585091421 and the p-value is 0.361637420212. For the feature Polit: The t-statistic is 0.0253164556962 and the p-value is 0.747159977819. For the feature PosNeg: The t-statistic is 0.021800281294 and the p-value is 0.88473120683. For the feature Race: The t-statistic is 0.0203938115331 and the p-value is 0.926460813379. For the feature Relig: The t-statistic is 0.0274261603376 and the p-value is 0.652939937831. For the feature St: The t-statistic is 0.0330520393812 and the p-value is 0.413171158957. For the feature WlbPhycs: The t-statistic is 0.0232067510549 and the p-value is 0.834343257756. For the feature WlbPsyc: The t-statistic is 0.0330520393812 and the p-value is 0.413171158957. For the feature bigramFreq: The t-statistic is 0.0344585091421 and the p-value is 0.361637420212. For the feature misspeltWord: The t-statistic is 0.0189873417722 and the p-value is 0.958143834265. For the feature perfectRhymeFreq: The t-statistic is 0.0210970464135 and the p-value is 0.906787165354. For the feature sentence_count: The t-statistic is 0.0182841068917 and the p-value is 0.970092145711. For the feature slantRhymeFreq: The t-statistic is 0.0309423347398 and the p-value is 0.497907989944. For the feature trigramFreq: The t-statistic is 0.0253164556962 and the p-value is 0.747159977819.
The second part of this writeup analyzes literacy skills across different demographics, particularly the income level versus the trigram frequency. We take trigram frequency as a proxy for language mastery. Advanced language skills should enable the poet to string together unexpected combinations of words, hence result in a lower trigram frequeny.
For each geographical region, we calculate the median value for each feature. We require a minimum number of poems for any given tract to avoid small sample bias. Then we run a linear regression to find the income - normalized trigram relationship. The charts illustrate the relationship. X-axis is the income, Y-axis is the normalized trigram frequency for the region.
We notice that there is some downward trend in trigram frequency which is what we were hoping for.
features = pd.read_csv('data/poetry_features.csv')
features = features.drop(['punctFreq','alliterFreq','unigramFreq','Race'],axis=1)
features.index = features['nid']
features = features.drop(['nid','created','uid'],axis=1)
tracts = pd.read_csv('data/location_tracts.csv')
loc = tracts[['nid','AcsHouseholdIncomeMedian','TractCode']]
loc.index = loc['nid']
loc = loc.drop('nid',axis=1)
features_with_loc = loc.join(features)
features_with_loc = features_with_loc.dropna(axis=0)
features_with_loc['trigramFreq'] = (features_with_loc['trigramFreq'] - features_with_loc['trigramFreq'].mean()) / float(features_with_loc['trigramFreq'].std())
#Group By Tracts. #Take tracts with only at least 20 poems. 139 tracts.
tract_size = features_with_loc.groupby('TractCode').size()
tract_list = tract_size[tract_size>20]
#Subset of feature set
features_with_loc['take'] = features_with_loc['TractCode'].map(lambda x: x in tract_list)
features_with_loc_subset = features_with_loc.dropna(axis=0)
features_with_loc_subset = features_with_loc[features_with_loc['take']==True]
#Group By TractCode. Take the median. 139 Tracts in general
features_by_tract = features_with_loc_subset.groupby('TractCode').aggregate('median')
features_by_tract = features_by_tract.drop(['take'],axis=1)
features_by_tract['AcsHouseholdIncomeMedian'] = features_by_tract['AcsHouseholdIncomeMedian'].map(lambda x: int(x))
trigram_by_tract = pd.concat([features_by_tract['trigramFreq'],np.log(features_by_tract['AcsHouseholdIncomeMedian'])],axis=1)
print 'Number of regions', len(trigram_by_tract)
Number of regions 136
def average_progress_chart(data):
figsize(20, 15)
fig = plt.figure()
sx = plt.subplot(1,1,1)
plt.setp(sx.get_yticklabels(), visible=False)
#Median for each poem number first
#print feature
y = data.values[:,0]
X = data['AcsHouseholdIncomeMedian'].values
X = sm.add_constant(X)
res = sm.OLS(y,X).fit()
betas_median = res.params[1]
#Plot the Progress
#Scatter
plt.scatter(X[:,1],y, alpha=0.9,marker=(6,0))
#Line
x_ = np.unique(X[:,1])
y_ = (res.params[0] + res.params[1]*x_)
#Fit the Line of the regression
plt.plot(x_,y_,color='red')
#plt.legend([column])
#xlim(0,cut)
plt.autoscale(tight=True)
#Linear Fit
plt.text(x_.min(), y.max(), 'beta of the slope is {:10.4f} with p-value of {:10.4f}'.format(res.params[1],res.f_pvalue), fontsize=20)
#Labels
sx.set_ylabel('Trigram Frequencies')
sx.set_xlabel('Income Levels')
sx.set_title('Trigram Frequencies by Income Levels')
print res.summary()
However, the explanatory power of the slope is not very high with R-squared around 5%. Nonetheless, the downtrend in trigram frequency with higher income levels is visually observable.
average_progress_chart(trigram_by_tract)
OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.053 Model: OLS Adj. R-squared: 0.046 Method: Least Squares F-statistic: 7.501 Date: Thu, 24 Apr 2014 Prob (F-statistic): 0.00701 Time: 12:24:40 Log-Likelihood: 24.357 No. Observations: 136 AIC: -44.71 Df Residuals: 134 BIC: -38.89 Df Model: 1 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ const 1.5249 0.513 2.971 0.004 0.510 2.540 x1 -0.1303 0.048 -2.739 0.007 -0.224 -0.036 ============================================================================== Omnibus: 31.445 Durbin-Watson: 1.962 Prob(Omnibus): 0.000 Jarque-Bera (JB): 51.949 Skew: -1.102 Prob(JB): 5.24e-12 Kurtosis: 5.075 Cond. No. 320. ==============================================================================
The last step is to run a simple two-sided t-test to see if the means of literacy skills for poets from high-income and low-income are different statistically. We use the median income level to split the data into two equal-sized components.
def ab_test_income(data):
figsize(25, 10)
fig = plt.figure()
median_income = trigram_by_tract.AcsHouseholdIncomeMedian.median()
a = trigram_by_tract['trigramFreq'][trigram_by_tract.AcsHouseholdIncomeMedian<=median_income]
b = trigram_by_tract['trigramFreq'][trigram_by_tract.AcsHouseholdIncomeMedian>median_income]
lp = np.linspace(min(min(a),min(b)),max(max(a),max(b)),10)
ax = plt.subplot(211)
plt.hist(a,bins=lp,alpha=0.5,color="#467821")
plt.legend(['Lower Income'])
ax = plt.subplot(212)
plt.hist(b,bins=lp,alpha=0.5,color="#7A68A6")
plt.autoscale(tight=True)
plt.legend(['Higher Income'])
#two_sample = stats.ttest_ind(a, b)
#test_stat = stats.ranksums(a, b)
test_stat = stats.ks_2samp(a, b)
print 'The mean trigram frequency for poets from high-income neighborhoods is {0}'.format(a.mean())
print 'The mean trigram frequency for poets from high-income neighborhoods is {0}'.format(b.mean())
print 'The t-statistic is {0} and the p-value is {1}.'.format(test_stat[0],test_stat[1])
The chart below shows the trigram frequency distribution for each subgroup. A lower frequency indicates a higher sophistication. We find weak evidence that the poets from high-income areas have superior language skills as measured by trigram frequencies. There are two possible problems with this finding. First, the zip-code / tract-level location might not be granular enough and have a high variance attached in terms of income levels. Thus, the median income level might not be a good representation of the income level of given poet. Secondly, trigram frequency might not be the best metrics to measure the language sophistication. In the future iterations, we might need to design a more refined metric to measure the literacy levels given the poem.
ab_test_income(trigram_by_tract)
The mean trigram frequency for poets from high-income neighborhoods is 0.149314053871 The mean trigram frequency for poets from high-income neighborhoods is 0.0903278625066 The t-statistic is 0.205882352941 and the p-value is 0.0974667982579.