#!/usr/bin/env python # coding: utf-8 # This is this seventh in a series of notebooks designed to show you how to analyze social media data. For demonstration purposes we are looking at tweets sent by CSR-related Twitter accounts -- accounts related to ethics, equality, the environment, etc. -- of Fortune 200 firms in 2013. I assume you have already downloaded the data and have completed the steps taken in Chapter 1, Chapter 2, Chapter 3, Chapter 4, Chapter 5, and Chapter 6. In this notebook I will take you through several basic analyses of audience reaction to the companies' tweets. Specifically, we'll look at the number of favorites and retweets received by each message. # # Chapter 7: Analyzing Audience Reaction # Before diving into the code, it might help to provide some background information here on what we are trying to do. Briefly put, social media platforms are such superb technology -- from the researcher's standpoint -- precisely because they enable us to link organizations' actions (their messages) to audience reactions. Richard Waters and I make this argument in a recent article on nonprofit organizations' use of Facebook: # # - Saxton, G. D., & Waters, R. D. (2014). What do stakeholders ‘like’ on Facebook? Examining public reactions to nonprofit organizations’ informational, promotional, and community-building messages. Journal of Public Relations Research, 26, 280-299 # # We specifically argue (p. 281) that: # # - ...what has not been previously examined is the public’s online response to organizational relationship-building communication efforts. As a result, although the literature strongly implies that stakeholders want, for instance, more interactive, dialogic communication, such assertions are largely untested outside the lab, primarily because web sites, which have been the primary data source in new media studies, do not allow for the easy gathering of stakeholder-response data....The rise of social media applications, however, presents a tremendous research opportunity for observing how the public responds to organizational engagement efforts. Social media grants scholars and practitioners alike the ability to examine both organizations’ dynamic communicative actions — particularly the sending of discrete messages — and the public’s reactions to those messages. Social media platforms provide the ability to observe the near real-time relationship between organizational actions and public reactions directly. This observation provides access to standardized data on organizational relationship-building actions and offers the ability to examine the effectiveness of organizations’ online stakeholder communications. # # In our Facebook study we looked at the number of `likes`, `comments`, and `shares` of each message. Twitter has three analogous measures: `favorites`, `replies`, and `retweets`. Gathering data on the number of replies is possible but additional steps are necessary. Consqequently, in this tutorial I'll concentrate on the two easier measures: `retweets` and `favorites`. Here's an example from one of my tweets. # In[25]: from IPython.display import Image Image(width=600, filename='infographic.png') # As we can see, the tweets received a moderate amount of audience reaction: two `retweets` and three `favorites`. Those `retweets` and `favorites` are solid indicators of audience reaction. They are an excellent vehicle for examining the impact of each message. And fortunately, using the code shown in other tutorials, these data were downloaded and included in our tweet-level dataset in the columns `retweet_count` and `favorite_count`. Our attention will thus be on these two variables. # #
# ## Import packages and set viewing options # As per normal, we will first import several necessary Python packages and set some options for viewing the data. As with prior chapters, we will be using the Python Data Analysis Library, or PANDAS, extensively for our data manipulations. # In[26]: import numpy as np import pandas as pd from pandas import DataFrame from pandas import Series # In[27]: #Set PANDAS to show all columns in DataFrame pd.set_option('display.max_columns', None) # I'm using version 0.16.2 of PANDAS # In[28]: pd.__version__ # #### Import graphing packages # We'll be producing some figures in this tutorial so we need to import various graphing capabilities. The default Matplotlib library is solid. # In[29]: import matplotlib print matplotlib.__version__ # In[30]: import matplotlib.pyplot as plt # In[31]: #NECESSARY FOR XTICKS OPTION, ETC. from pylab import* # One of the great innovations of ipython notebook is the ability to see output and graphics "inline," that is, on the same page and immediately below each line of code. To enable this feature for graphics we run the following line. # In[32]: get_ipython().run_line_magic('matplotlib', 'inline') # We will be using Seaborn to help pretty up the default Matplotlib graphics. Seaborn does not come installed with Anaconda Python so you will have to open up a terminal and run pip install seaborn. # In[33]: import seaborn as sns print sns.__version__ #
The following line will set the default plots to be bigger. # In[34]: plt.rcParams['figure.figsize'] = (15, 5) # ## Read in data # In Chapter 4 we created a version of the dataframe that omitted all tweets that were retweets, allowing us to focus only on original messages sent by the 41 Twitter accounts. In Chapter 5 we then added 6 new variables to this dataset. Let's now open this saved file. As we can see in the operations below this dataframe contains 60 variables for 26,257 tweets. # In[184]: df = pd.read_pickle('Original 2013 CSR Tweets with 3 binary variables.pkl') print "# of variables in dataframe:", len(df.columns) print "# of tweets in dataframe:", len(df) df.head(2) # ## Describing Audience Reaction # Let's first describe the data for our two columns of interest. I like suppressing scientific notation in my numbers. So, if you'd rather see "0.48" than "4.800000e-01", then run the following line. Note that this does not change the actual values. # In[36]: pd.set_option('display.float_format', lambda x: '%.2f' % x) # In[38]: df[['retweet_count','favorite_count']].describe().T #np.round(df[['retweet_count','favorite_count']].describe(), 2).T ##ALTERNATIVE CODE IF NOT SETTING THE FLOAT FORMAT OPTION ABOVE #
We see that, on average, the `26,257` original tweets sent by the 41 accounts in 2013 receive 3.83 retweets and 1.52 favorites. # ## Correlation between Retweets and Favorites # It is worth examining the extent to which retweeting and favoriting activity are related. If the relationship is strong enough, then we might choose to simplify subsequent statistical analyses and only use one of the variables as a proxy for audience engagement. # #### Pearson Correlation # First, let's run a pearson correlation. # In[62]: df['retweet_count'].corr(df['favorite_count']) # This indicates a very strong positive correlation between `retweet_count` and `favorite_count`. But let's explore a few other ways at looking at the relationship while I show you a few other PANDAS tools. # #### Cross-tabs # We can also run a cross-tabulation between the binary `(0,1)` variables `favorites_binary` and `RTs_binary`, which we created in Chapter 6. # In[63]: pd.crosstab(df['favorites_binary'], df['RTs_binary']) # We see that the large majority of tweets that are never retweeted are also never favorited, while most of those that are favorited are also retweeted. However, the converse is not true, partly due to the lower average frequency of favoriting compared to retweeting. But overall, it's not a perfect relationship by any means. # To make it easier to see proportions, we can work with percentages instead. First we can show row percentages. # In[66]: from __future__ import division #To make sure PANDAS always returns a float pd.crosstab(df['favorites_binary'], df['RTs_binary']).apply(lambda r: r/r.sum(), axis=1) # We can also show column percentages. # In[58]: pd.crosstab(df['favorites_binary'], df['RTs_binary']).apply(lambda r: r/r.sum(), axis=0) # And even total percentages. # In[50]: pd.crosstab(df['favorites_binary'], df['RTs_binary']).apply(lambda r: r/len(df), axis=1) # #### Visualizing the relationship # We can also examine the relationship between the two variables visually. Here is a scatter plot. It certainly looks like a positive relationship. # In[67]: df.plot(kind='scatter', x='favorite_count', y='retweet_count') #
Recall that we are using the `Seaborn` package along with `matplotlib` to improve our graphics. As described in more detail here, Seaborn has some great tools for quickly visualizing the relationship between two variables. I'll show you one of them, `regplot`. This will run a scatter plot between the two variables, same as above, while also adding a regression model fit line with confidence intervals. # In[95]: sns.regplot(x="favorite_count", y="retweet_count", data=df) #
As usual, run the following for help. # In[68]: get_ipython().run_line_magic('pinfo', 'sns.lmplot') # We want to fix the x-axis and y-axis limits so there are no negative numbers. # In[96]: sns.regplot(x="favorite_count", y="retweet_count", data=df).set(xlim=(0, 1400), ylim=(0, 5000)) #LONG WAY OF WRITING ABOVE IF YOU WANT TO BREAK UP CODE #g = sns.regplot(x="favorite_count", y="retweet_count", data=df) #g.set(xlim=(0, 1400), ylim=(0, 5000)) #
What we see again is a solid positive relationship between `retweet_count` and `favorites_count`. Taking all the evidence into account, it is a judgment call whether you'd want to only use one of these variables as your proxy for overall audience engagement. It would depend on the context of your research question and your particular data. For ease of presentation, let's assume you are interested only in what factors are associated with the diffusion of the organizations' messages. In this case, you'd be on solid ground limiting your investigation to retweets. Accordingly, for the remaining analyses I'll focus on exploring `retweet_count` in greater depth. # ## Digging Deeper into Retweet Patterns # Let's take a look at the frequencies for `retweet_count`. # In[142]: df.retweet_count.value_counts() #
Above we see that `11,923` tweets receive `zero` retweets. Let's run the frequencies again (showing only the first 10 values), this time with percentages. # In[110]: df.retweet_count.value_counts(normalize=True)[:10] #
45.41% of original tweets do not receive a single RT. If you want to run this manually you can do the following: # In[143]: 11923/len(df['retweet_count']) # or # In[144]: len(df[df['retweet_count'] < 1])/len(df['retweet_count']) # or # In[145]: sum(df['retweet_count'] < 1)/len(df['retweet_count']) #
Another `5,583` tweets receive a single retweet. # In[148]: len(df[df['retweet_count'] == 1]) #sum(df['retweet_count'] == 1) #SLIGHTLY SHORTER WAY # Thus almost precisely two-thirds of the tweets receive 1 retweet or fewer. # In[157]: len(df[df['retweet_count']<2])/len(df['retweet_count']) # Recall that in Chapter 4 we found evidence that hashtag use did not follow a `normal` distribution but rather a `power law` distribution. A quick look at the above frequencies tables suggests that the distribution of retweets also likely approximates a power law distribution. There will thus be many tweets that are retweeted infrequently and a relatively small number that are retweeted a lot. Before plotting the distribution let's dig a bit deeper into the more successful tweets. # The average # of RTs for tweets that get retweeted at least once is `7.02` # In[155]: df[df['retweet_count'] > 0].retweet_count.describe() #
The average # of RTs for original tweets with a greater than average # of RTs is `20.48` # In[245]: df[df['retweet_count'] > df['retweet_count'].mean()].retweet_count.describe() #
Only `4,099` tweets (or `15.6%`) receive more retweets than average. Definitely not a `normal distribution`. # In[172]: print sum(df['retweet_count'] > df['retweet_count'].mean()) print sum(df['retweet_count'] > df['retweet_count'].mean())/len(df['retweet_count']) #
Only `2,322` tweets (or `8.9%`) receive more than 5 retweets. # In[174]: print sum(df['retweet_count'] > 5) print sum(df['retweet_count'] > 5)/len(df['retweet_count']) #
Only `981` tweets (or `3.7%`) receive more than 10 retweets. # In[175]: print sum(df['retweet_count'] > 10) print sum(df['retweet_count'] > 10)/len(df['retweet_count']) #
# ## Graphing `retweet_count` # First let's take a look at what we see in the default plot. # In[185]: df['retweet_count'].plot(kind='line') #
That's helpful but not very. Let's sort the dataframe. For help run the question mark. # In[186]: get_ipython().run_line_magic('pinfo', 'DataFrame.sort') # In[187]: df = df.sort(['retweet_count'], ascending=False) # In[188]: df[['retweet_count']].head() #
Now that it's sorted let's plot it again. We'll omit the tick labels because adding `26,257` labels would be unwieldy. :) # In[182]: bar_plot = df['retweet_count'].plot(kind='bar') bar_plot.set_xticklabels('') bar_plot.set_xlabel('') yticks(fontsize = 8) #
Again, this plot is helpful -- more helpful than the prior version -- but it is still far from ideal. The problem is that the distribution is so skewed and covers a broad range (from `0` to `3,719` retweets) that everything looks squished. So let's split the data up in order to see the distribution more clearly. I've chosen to separate the tweets with more than 50 retweets (`n=188`) from those with 50 or fewer retweets (`n=26,029`). # In[194]: df50 = df[df['retweet_count'] < 51] df50 = df50.sort(columns='retweet_count', axis=0, ascending=False) print len(df50) # In[195]: df50plus = df[df['retweet_count'] > 50] df50plus = df50plus.sort(columns='retweet_count', axis=0, ascending=False) print len(df50plus) #
Now let's plot the two respective bar graphs. # In[197]: bar_plot = df50['retweet_count'].plot(kind='bar') bar_plot.set_xticklabels('') bar_plot.set_xlabel('') yticks(fontsize = 8) # In[198]: bar_plot = df50plus['retweet_count'].plot(kind='bar') bar_plot.set_xticklabels('') bar_plot.set_xlabel('') yticks(fontsize = 8) #
Neither bar graph is very pretty, but we only wanted to get a quick look at the distribution. Both plots display the same general shape, with a quick drop-off from the 'winners' on the left-hand side to the relative 'losers' on the right-hand side. To get a clear look at the distribution, though, we should turn to histograms. # In[204]: pd.set_option('display.mpl_style', 'default') # MPL style plt.rcParams['figure.figsize'] = (15, 5) # In[214]: density_plot = df50.retweet_count.hist(bins=50) density_plot.set_ylabel('Count', labelpad=15) density_plot.set_xlabel('Number of Retweets', labelpad=15) savefig('histogram - 26069 least RT-ed-mpl.png', bbox_inches='tight', dpi=300, format='png') # In[208]: density_plot = df50plus.retweet_count.hist(bins=50) density_plot.set_ylabel('Count', labelpad=15) density_plot.set_xlabel('Number of Retweets', labelpad=15) savefig('histogram - 188 most RT-ed-mpl.png', bbox_inches='tight', dpi=300, format='png') #
Both histrograms show the markedly non-normal nature of the distribution of the data. Whether we look at the `188` most-heavily retweeted tweets (those with more than 50 retweets) or the `26,069` less heavily retweeted messages, the pattern is the same. If you want more evidence, let's differentiate at 100 retweets and re-plot. # In[229]: df100 = df[df['retweet_count'] < 101] df100 = df100.sort(columns='retweet_count', axis=0, ascending=False) print len(df100) # In[230]: df100plus = df[df['retweet_count'] > 100] df100plus = df100plus.sort(columns='retweet_count', ascending=False, axis=0) print len(df100plus) #
Let's tweak the plots by adding a line indicating the mean value within each group. # In[231]: density_plot = df100.retweet_count.hist(bins=50) density_plot.set_ylabel('Count', labelpad=15) density_plot.set_xlabel('Number of Retweets', labelpad=15) plt.axvline(df100['retweet_count'].mean(), color='r', linestyle='dashed', linewidth=1) savefig('histogram - 26138 least RT-ed-mpl.png', bbox_inches='tight', dpi=300, format='png') # In[232]: density_plot = df100plus.retweet_count.hist(bins=50) density_plot.set_ylabel('Count', labelpad=15) density_plot.set_xlabel('Number of Retweets', labelpad=15) plt.axvline(df100plus['retweet_count'].mean(), color='r', linestyle='dashed', linewidth=1) savefig('histogram - 119 most RT-ed-mpl.png', bbox_inches='tight', dpi=300, format='png') # ### Versions of Histograms with Different Matplotlib Styles # Let's end by running our histograms in a few other styles. # In[243]: mpl.style.available # #### ggplot style # In[239]: mpl.style.use('ggplot') density_plot = df50.retweet_count.hist(bins=50) density_plot.set_ylabel('Count', labelpad=15) density_plot.set_xlabel('Number of Retweets', labelpad=15) savefig('histogram - 26069 least RT-ed - ggplot style.png', bbox_inches='tight', dpi=300, format='png') # In[240]: density_plot = df50plus.retweet_count.hist(bins=50) density_plot.set_ylabel('Count', labelpad=15) density_plot.set_xlabel('Number of Retweets', labelpad=15) savefig('histogram - 188 most RT-ed.png - ggplot style', bbox_inches='tight', dpi=300, format='png') # #### 538 style # In[241]: mpl.style.use('fivethirtyeight') density_plot = df50.retweet_count.hist(bins=50) density_plot.set_ylabel('Count', labelpad=15) density_plot.set_xlabel('Number of Retweets', labelpad=15) savefig('histogram - 26069 least RT-ed - 538.png', bbox_inches='tight', dpi=300, format='png') # In[242]: density_plot = df50plus.retweet_count.hist(bins=50) density_plot.set_ylabel('Count', labelpad=15) density_plot.set_xlabel('Number of Retweets', labelpad=15) savefig('histogram - 188 most RT-ed - 538.png', bbox_inches='tight', dpi=300, format='png') #
In this tutorial we have taken an in-depth look at two measures of audience reaction -- `retweet_count` and `favorite_count`. Such variables are excellent choices for the dependent variable in your statistical analyses. In the next tutorial I'll provide an overview of how to test your hypotheses using logistic regression. # # For more Notebooks as well as additional Python and Big Data tutorials, please visit http://social-metrics.org or follow me on Twitter @gregorysaxton