#!/usr/bin/env python # coding: utf-8 # # Loading Stats SA ASCII-formatted time series data # # ## Introduction # # Statistics South Africa (Stats SA) is the government statistician here in South Africa. They publish stats about the economy, standard of living, the performance of local government, and lots of other stuff. If you're interested in finding out the state of South Africa in cold hard numbers, head over to [http://www.statssa.gov.za/](http://www.statssa.gov.za/). # # Most of their data is, unfortunately, locked up in reports that can be downloaded as PDFs - i.e. not that easy to do one's own analyses. Sometimes, they do make time series data available though. When time series data is available for download, its provided in two formats: Excel work books and ASCII files. Working with Excel is easy, but I feel like a challenge and I feel like avoiding Excel. That's why this notebook will focus on making a function to parse the ASCII files made available by Stats SA. I'll be putting this function to use in subsequent notebooks to make some interesting analyses of Stats SA data. # # This notebook and related data can be downloaded from my repository for Stats SA-related blog posts: [StatsSA-blog](https://bitbucket.org/williamjshipman/statssa-blog) # In[1]: get_ipython().run_line_magic('matplotlib', 'inline') # In[2]: import pandas as pd from os.path import expanduser from datetime import date import seaborn as sns # ## The first stage - ASCII to a data structure # # My first step is going to be to transform the text file into a list of time series structures. The basic structure of each time series is a sequence of meta-data lines followed by the actual values, one per line. # # The meta-data lines start with tag Hxx (e.g. H01, H13), which is followed by some text. The actual data follows the meta-data, with each value in the time series printed on a new line. The end of the time series is indicated either by the end of the text file or by the appearance of a new meta-data line. # In[3]: all_time_series = [] with open(expanduser('./data/CPI/ASCII Consumer Price Index - Jan 2008 to Nov 2015.txt'), 'r') as inputfile: for inputline in inputfile: if inputline[0] == 'H': # This is a meta-data field. tag, sep, value = inputline.partition(': ') if tag == 'H01': all_time_series += [{'data':[], 'tags':{}}] all_time_series[-1]['tags'][tag] = value.strip() else: all_time_series[-1]['data'] += [float(inputline)] # In[4]: len(all_time_series) # In[5]: all_time_series[0]['tags'] # The tag H01 gives the publication number, while H02 gives the title associated with that publication. In this case, I'm looking at the Consumer Price Index in South Africa. The data doesn't just give the overall CPI though. It also gives CPI for different baskets of goods in each province. The basket of goods is indicated by the H04 tag, e.g. 'All items', 'Clothing' or 'Tobacco' are some of the baskets for which CPI is calculated. The H13 tag tells us which province the time series applies to, or if it applies to the whole of South Africa. # ## Making Pandas Series objects # # Now that each time series has been extracted, the next step is to construct some Pandas Series objects, one for each time series. You'll see that the Stats SA data specifies the start date for a time series and an update frequency e.g. monthly. These need to be turned into a DateRange that can be used as an index into the Series. # In[6]: for tsdata in all_time_series: if tsdata['tags']['H25'].strip() == 'monthly': _, start_year, start_month = tsdata['tags']['H24'].split(' ') index = pd.date_range(start=date(int(start_year), int(start_month), 1), freq='M', periods=len(tsdata['data'])) tsdata['series'] = pd.Series(tsdata['data'], index=index, name='{code:s} {quant:s} ({province:s}) - {detail:s}'.format( code=tsdata['tags']['H01'], quant=tsdata['tags']['H02'], detail=tsdata['tags']['H04'], province=tsdata['tags']['H13'])) # ## Plotting the data # # Plotting the data shows that some of the indices have actually suffered deflation since 2008, but I'll look at that in another post. You'll also see that all the series converge at December 2012. That is because I'm plotting Consumper Price Index time series. These indices are all relative to December 2012, which was chosen to be the value 100. What this means is that if in January 2008 the index was at 300 for one of the time series, the price of that basket of goods was three times higher than in December 2015. # In[7]: sns.plt.figure(figsize=(8,6)) for tsdata in all_time_series: tsdata['series'].plot() sns.plt.show() # ## Putting it into a DataFrame # # All of these time series have the same date range and frequency, i.e. monthly. Therefore, they can be combined into a single DataFrame. Here I'm going to make each time series into one column of the DataFrame. The column indices will be hierarchichal, using the province and detail, i.e. H13 and H04 tags. # In[8]: df_data = {(tsdata['tags']['H13'], tsdata['tags']['H04']): tsdata['series'] for tsdata in all_time_series} df = pd.DataFrame(data=df_data) df # ## What's next # # Now that I can load the CPI data, I intend doing some more analysis in a future blog post. I'm also going to turn the code here into a decent function (or two) that will give me all the series and the DataFrame. I still need to handle frequencies other than monthly when loading the time series, e.g. quarterly and annually. # In[ ]: