#!/usr/bin/env python # coding: utf-8 # This is this sixth in a series of notebooks designed to show you how to analyze social media data. For demonstration purposes we are looking at tweets sent by CSR-related Twitter accounts -- accounts related to ethics, equality, the environment, etc. -- of Fortune 200 firms in 2013. I assume you have already downloaded the data and have completed the steps taken in Chapter 1, Chapter 2, Chapter 3, Chapter 4, and Chapter 5. In this notebook I will show you how to run and then save for output your descriptive statistics. The desired end product is the a CSV table of key summary statistics -- count, mean, std. dev., min. and max -- for the variables in your dataset. # # Also known as descriptive statistics, summary statistics are crucial for helping readers understand the nature of your data, especially for helping convey the range and dispersion of your data. A summary statistics table is mandatory in some journals and some disciplines whenever one is presenting statistical analyses of quantitative data. Below is an example of a summary statistics table from an article Chao Guo and I published last year on 150 nonprofit advocacy organizations' use of Twitter: # # - Guo, C., & Saxton, G. D. (2014). Tweeting social change: How social media are changing nonprofit advocacy. Nonprofit & Voluntary Sector Quarterly, 43, 57-79. # # As you can see, the table shows the count, mean, std. dev., and minimum and maximum values for each quantitative variable in our analyses. # In[59]: from IPython.display import Image Image(width=750, filename='Descriptive Statistics Table.png') #
# # Chapter 6: Producing a Summary Statistics Table # As per normal, we will first import several necessary Python packages and set some options for viewing the data. As with prior chapters, we will be using the Python Data Analysis Library, or PANDAS, extensively for our data manipulations. # ### Import packages and set viewing options # In[2]: import numpy as np import pandas as pd from pandas import DataFrame from pandas import Series # In[3]: #Set PANDAS to show all columns in DataFrame pd.set_option('display.max_columns', None) # I'm using version 0.16.2 of PANDAS # In[5]: pd.__version__ #
I like suppressing scientific notation in my numbers. So, if you'd rather see "0.48" than "4.800000e-01", then run the following line. Note that this does not change the actual values. For outputting to CSV we'll have to run some additional code later on. # In[17]: pd.set_option('display.float_format', lambda x: '%.2f' % x) # ### Read in dataframe # In Chapter 4 we created a version of the dataframe that omitted all tweets that were retweets, allowing us to focus only on original messages sent by the 41 Twitter accounts. In Chapter 5 we then added 6 new variables to this dataset. Let's now open this saved file. As we can see in the operations below this dataframe contains 60 variables for 26,257 tweets. # In[32]: df = pd.read_pickle('Original 2013 CSR Tweets with 3 binary variables.pkl') print "# of variables in dataframe:", len(df.columns) print "# of tweets in dataframe:", len(df) df.head(2) #
# # List all the columns in the DataFrame # In[8]: df.columns # ### Create Sub-Set of DataFrame with only Desired Variables # You might not want to include all of your variables in the summary statistics table. When you're dealing with a dataset with a lot of columns, I find the easiest way is to output the column names to a list, copy and paste the output into another cell, then delete the columns you don't want. # In[33]: print df.columns.tolist() #
I've copy and pasted the above output into the cell below and kept only a subset of the columns. Note the use of the single square brackets above to denote column names but the double square brackets below. In PANDAS the double brackets refer to dataframes; in the following line I am thus saying I want my dataframe `df` to be limited to the columns listed on the right-hand side of the equation. # In[34]: df = df[['content','from_user_screen_name','from_user_followers_count','from_user_listed_count','from_user_statuses_count','retweet_count','favorite_count','entities_urls_count','entities_hashtags_count','entities_mentions_count', 'num_characters','Company', 'English','RTs_binary','favorites_binary','hashtags_binary','mentions_binary', 'URLs_binary']] print "# of variables in dataframe:", len(df.columns) print "# of tweets in dataframe:", len(df) df.head(2) #
# # As you can see above, we now have a dataframe with only 18 variables. # ### Generate Summary Statistics # The `describe` function is the basic way to produce summary statistics for all the variables in your dataframe. # In[35]: df.describe() #
# # If you'd like to see the help for the describe function use the question mark. # In[49]: get_ipython().run_line_magic('pinfo', 'DataFrame.describe') #
# # Use the dir function to get an alphabetical listing of valid names (attributes) in an object. # In[38]: print dir(df.describe()) #
# # CHANGE TO TWO DECIMALS (n.b. - This step is not necessary if you have run the display.float_format command earlier) # In[39]: np.round(df.describe(), 2) #
# # NOW LET'S TRANSPOSE THE OUTPUT -- necessary for a more typical social scientific presentation of the data. Note how only 15 variables are shown. These are our numerical variables. The categorical variables `content`, `from_user_screen_name`, and `Company` are not shown. # In[62]: np.round(df.describe(), 2).T #ALTERNATIVE WAY OF WRITING ABOVE #np.round(df.describe(), 2).transpose() #
# # We won't typically want the percentile columns in a social scientific publication. Supposedly, in version 0.16 of PANDAS, you can use 'percentiles=None' with the describe command to omit the percentiles. In version 0.16 as well as earlier versions of PANDAS we can alternatively select only those columns we want, then output to CSV. # In[41]: np.round(df.describe(), 2).T[['count','mean', 'std', 'min', 'max']] #
# ### Save the Output of the Table as a CSV File # Once you get more comfortable with Python and PANDAS you can combine your commands. For instance, we can simultaneously run our summary statistics and output the results to a CSV file. # In[43]: #WITH FOUR DECIMAL PLACES (DEFAULT) df.describe().transpose().to_csv('summary stats.csv', sep=',') #
# For a typical social scientific publication, we would not need the percentile columns. We can instead select only those columns we want, then output to CSV. # In[51]: df.describe().transpose()[['count','mean', 'std', 'min', 'max']].to_csv('summary stats.csv', sep=',') #
# The problem with the above output is that more than 2 decimal places are showing. If you want only two, then run the following version. # In[52]: #WITH TWO DECIMAL PLACES np.round(df.describe(), 2).T[['count','mean', 'std', 'min', 'max']].to_csv('summary stats.csv', sep=',') #
# # Now you have a CSV file containing the columns you'll need for a typical Summary Statistics or Descriptive Statistics table for a submission to a social science journal. You likely won't want all of the columns in the final table, so I would probably open up the CSV file in Excel, delete unwanted variables, then copy and paste into Word. At that point you just need some formatting for aesthetics. If you do want to select which specific variables to include, you can specify the columns like this. # In[54]: cols = ['retweet_count','RTs_binary'] np.round(df[cols].describe(), 2).T[['count','mean', 'std', 'min', 'max']].to_csv('summary stats (partial).csv', sep=',') # ### Outputting to LaTeX # In some disciplines (e.g., Political Science, Engineering, Computer Science, Accounting, Finance, Economics) it is common to use LaTeX rather than Word. PANDAS has excellent LaTeX capabilities. For instance, the first of the following three lines of code shows how to output to a `*`.tex file rather than CSV, while the second shows what the LaTeX code looks like. The third imports an image of what the table looks like once it's rendered in TeXShop. # In[64]: np.round(df.describe(), 2).T[['count','mean', 'std', 'min', 'max']].to_latex('summary stats.tex') # In[67]: print np.round(df.describe(), 2).T[['count','mean', 'std', 'min', 'max']].to_latex() # In[70]: Image(width=600, filename='Descriptive Statistics Table (LaTeX).png') #
In this tutorial we have covered how to generate a summary statistics table in preparation for further analyses and for submitting your work to scholarly outlets. In the following tutorials I'll introduce you to how to analyze audience reaction to the companies' tweets as well as how to test your hypotheses using logistic regression. # # For more Notebooks as well as additional Python and Big Data tutorials, please visit http://social-metrics.org or follow me on Twitter @gregorysaxton