This notebook aims to be an example of use of the Python package pyjstat, a library to deal with JSON-stat format using Python pandas dataframe structures.
In order to make it as simple, fast and visually rich as possible, I'll use a package which brings Javascript visualization capabilities to Python: Vincent
Let's start getting some example data. It seems that, at the moment, JSON-stat format is mostly used at Statistics Norway, so we'll get some datasets from their catalog and make some plots.
For this example (a choroplehts map), I'll use the same dataset as Xavier Badosa did in his example. I have to say that, at first, I choosed it by chance; then, I realized that it was pretty convenient in order to compare both ways of representation.
Let's grab the Norway JSON-stat bundle no. 1108 using urllib2 and transform it into a pandas dataframe using pyjstat:
from pyjstat import pyjstat
from collections import OrderedDict
import urllib2
import json
dataset_url_1 = 'http://data.ssb.no/api/v0/dataset/1108.json?lang=en'
# Important: JSON data must be retrieved in order;
# this can be accomplished using the object_pairs_hook parameter
# of the json.load method.
population_json_data = json.load(urllib2.urlopen(dataset_url_1),
object_pairs_hook=OrderedDict)
population_results = pyjstat.from_json_stat(population_json_data, naming="id")
# Get the first result, since we're using only one input dataset
population_dataset = population_results[0]
# Filter the variable to plot
population_data = population_dataset[population_dataset['ContentsCode'] ==
'Folketallet11']
population_data.head()
Region | ContentsCode | Tid | value | |
---|---|---|---|---|
10 | 101 | Folketallet11 | 2014K1 | 30190 |
21 | 104 | Folketallet11 | 2014K1 | 31351 |
32 | 105 | Folketallet11 | 2014K1 | 53994 |
43 | 106 | Folketallet11 | 2014K1 | 77695 |
54 | 111 | Folketallet11 | 2014K1 | 4384 |
5 rows × 4 columns
Note that I've called from_json_stat method using naming="id" parameter. It could have been called using "label", nevertheless. Also, note that JSON data must be ordered before being passed to pyjstat.
Now we're ready to plot the value column versus the region id in a Norway's municipalities map. For this example I've used this geoJSON map after converting it to TopoJSON.
import vincent
from vincent.values import ValueRef
from IPython.display import display
vincent.core.initialize_notebook()
# Get the map data
norway_topo = r'http://d1jbhcb2qvxl2c.cloudfront.net/norway_id.topo.json'
geo_data = [{'name': 'norway',
'url': norway_topo,
'feature': 'norway'}]
# Create the chart and bind the data to the map
vis_map = vincent.Map(data=population_data, geo_data=geo_data, scale=1000,
projection = 'mercator', center = [17.81, 63.4],
data_bind='value', data_key='Region',
map_key={'norway': 'properties.KOMM'})
vis_map.marks[0].properties.enter.stroke_opacity = ValueRef(value=0.5)
# Get the categories for labeling the chart
categories = pyjstat.get_dim_label(population_json_data['dataset'],'ContentsCode')
# Plot the chart
vis_map.legend(title= str(categories.loc['Folketallet11', 'label']))
display(vis_map)
Since I am anything but a GIS expert, I've used the same center and scale parameter values as Badosa. It seems the results looks alike to each other, so I guess it's not that bad ;-)
I'm not going to rack my brains and will use the default Vincent parameters for this one:
# Filter dataset by region (Oslo, for example )
oslo = population_dataset.loc[population_dataset['Region'] ==
301, ['ContentsCode', 'value']]
# Reindex the dataset for a better x-axis legend
oslo = oslo.set_index('ContentsCode')
# Remove non-comparable variables
bar = vincent.Bar(oslo.iloc[1:10])
# Plot the chart
bar.axis_titles(x='Variable', y='Population')
display(bar)
Now, I'll use a Consumer Price Index dataset in order to show an example of time series plotting:
import datetime as dt
import time
dataset_url_2 = 'http://data.ssb.no/api/v0/dataset/1086.json?lang=en'
# Important: JSON data must be retrieved in order;
# this can be accomplished using the object_pairs_hook parameter
# of the json.load method.
cpi_json_data = json.load(urllib2.urlopen(dataset_url_2),
object_pairs_hook=OrderedDict)
cpi_results = pyjstat.from_json_stat(cpi_json_data)
# Get the first result, since we're using only one input dataset
cpi_dataset = cpi_results[0]
# Filter the variable to plot
cpi_ts = cpi_dataset[cpi_dataset['contents'] ==
'12-month rate (per cent)']
# split time column into year and month and generate datetime column
year, month = zip(*(s.split("M") for s in list(cpi_ts['time'])))
dates = [dt.datetime(int(y) , int(m), 1) for m,y in zip(month,year)]
# filter unwanted columns and add datetime column
cpi_data = cpi_ts.loc[:,['consumption group','value']]
cpi_data['dates'] = dates
# pivot dataframe in order to get a column by category
cpi_data = cpi_data.pivot(index='dates',columns='consumption group', values='value')
# plot a line chart
line = vincent.Line(cpi_data)
line.scales[0].type = 'time'
line.axis_titles(x='Time', y='Value')
line.legend(title='Consumer Price Index - 12-month rate (per cent)')
display(line)
So, after some data parsing we can see a line plot of Norway's CPI All-item-index over about the last 30 years.
Finally, I'll use yet another dataset in order to obtain a Multi-line chart; concretely, the Index of production (2005=100) by industry/main industrial grouping.
import numpy as np
dataset_url_3 = 'http://data.ssb.no/api/v0/dataset/29843.json?lang=en'
# Important: JSON data must be retrieved in order;
# this can be accomplished using the object_pairs_hook parameter
# of the json.load method.
iop_json_data = json.load(urllib2.urlopen(dataset_url_3),
object_pairs_hook=OrderedDict)
iop_results = pyjstat.from_json_stat(iop_json_data)
# Get the first result, since we're using only one input dataset
iop_dataset = iop_results[0]
# Filter the variable to plot
iop_ts = iop_dataset[iop_dataset['contents'] ==
'Seasonally adjusted']
# split time column into year and month and generate datetime column
year, month = zip(*(s.split("M") for s in list(iop_ts['time'])))
dates = [dt.datetime(int(y) , int(m), 1) for m,y in zip(month,year)]
# filter unwanted columns and add datetime column
iop_data = iop_ts.loc[:,['industry/main industrial grouping','value']]
iop_data['dates'] = dates
# remove null-valued rows
iop_data = iop_data[np.isfinite(iop_data['value'])]
# pivot dataframe in order to get a column by category
iop_data = iop_data.pivot(index='dates',columns='industry/main industrial grouping', values='value')
# plot a multi-line chart
multi = vincent.Line(iop_data)
multi.scales[0].type = 'time'
multi.axis_titles(x='Time', y='Value')
multi.legend(title='Index of Production (2005=100) Seasonally adjusted')
display(multi)
Perhaps there are too many categories in the chart to see what's going on, but since I was trying to show a multi-line example here, I guess it should do.
With these little examples, I've accomplished two different aims: learning how to use a little of Vincent Vega and showing some examples about pyjstat use. I hope it can be helpful for somebody. Please, feel free to make any comments, corrections, suggestions, improvements, etc. I'll be glad to receive your feedback!
By predicador37