This scripts in this notebook are utility scripts for generating a hierarchical JSON tree structure from a pandas dataframe.
Consider the following dataframe:
import pandas as pd
wards_csv_fn = 'wards_data.csv'
wards_df = pd.read_csv('wards_data.csv')
wards_df.head()
WD17CD | WD17NM | LAD17CD | LAD17NM | GOR10CD | GOR10NM | CTRY17CD | CTRY17NM | |
---|---|---|---|---|---|---|---|---|
0 | E05001678 | Newington | E06000010 | Kingston upon Hull, City of | E12000003 | Yorkshire and The Humber | E92000001 | England |
1 | E05001779 | Mickleover | E06000015 | Derby | E12000004 | East Midlands | E92000001 | England |
2 | E05001628 | Higher Croft | E06000008 | Blackburn with Darwen | E12000002 | North West | E92000001 | England |
3 | E05008942 | Burn Valley | E06000001 | Hartlepool | E12000001 | North East | E92000001 | England |
4 | E05001679 | Newland | E06000010 | Kingston upon Hull, City of | E12000003 | Yorkshire and The Humber | E92000001 | England |
Regions are contained within countries, local authorities within regions, wards within local authorities.
The Python treelib
provides support for creating simple tree structures. The following explicit code fragment will generate a treelib.Tree()
object from a dataframe when passed the column names corresponding to the ID value and label required for each level of the tree:
#%pip install treelib
from treelib import Tree
country_tree = Tree()
# Create a root node
country_tree.create_node("Country", "countries")
# Group by country
for country, regions in wards_df.head(5).groupby(["CTRY17NM", "CTRY17CD"]):
# Generate a node for each country
country_tree.create_node(country[0], country[1], parent="countries")
# Group by region
for region, las in regions.groupby(["GOR10NM", "GOR10CD"]):
# Generate a node for each region
country_tree.create_node(region[0], region[1], parent=country[1])
# Group by local authority
for la, wards in las.groupby(['LAD17NM', 'LAD17CD']):
# Create a node for each local authority
country_tree.create_node(la[0], la[1], parent=region[1])
for ward, _ in wards.groupby(['WD17NM', 'WD17CD']):
# Create a leaf node for each ward
country_tree.create_node(ward[0], ward[1], parent=la[1])
# Output the hierarchical data
country_tree.show()
Country └── England ├── East Midlands │ └── Derby │ └── Mickleover ├── North East │ └── Hartlepool │ └── Burn Valley ├── North West │ └── Blackburn with Darwen │ └── Higher Croft └── Yorkshire and The Humber └── Kingston upon Hull, City of ├── Newington └── Newland
Whilst the code works, it is a little messy. More generally, we can create a recursive function to traverse the tree for us:
%%writefile table2tree.py
from treelib import Tree
def create_tree(df, items, parent, root=None, tree=None, i=0):
"""Create a tree from a dataframe."""
if tree is None:
tree = Tree()
root = root if root else parent
tree.create_node(root, parent)
i = i + 1
for parental, group_df in df.groupby(items[i-1]):
tree.create_node(parental[0], parental[1], parent=parent)
if i <= len(items)-1:
create_tree(group_df, items, parental[1], tree=tree, i=i)
return tree
Overwriting table2tree.py
# Run the file as if we had run the code cell
%run table2tree.py
We can now specify a list of column pairs (label and ID) for each level of the tree and generate the tree from that:
# The items specify the label and the ID columns for each node in the tree
items = [["CTRY17NM", "CTRY17CD"],
["GOR10NM", "GOR10CD"],
['LAD17NM', 'LAD17CD'],
['WD17NM', 'WD17CD']]
tree = create_tree(wards_df.head(10), items, 'countries', 'Country' )
tree.show()
Country └── England ├── East Midlands │ └── Derby │ ├── Mickleover │ ├── Normanton │ └── Oakwood ├── North East │ └── Hartlepool │ └── Burn Valley ├── North West │ └── Blackburn with Darwen │ ├── Higher Croft │ ├── Little Harwood │ └── Livesey with Pleasington └── Yorkshire and The Humber └── Kingston upon Hull, City of ├── Newington ├── Newland └── Orchard Park and Greenwood
We can also export the tree as a JSON file:
import json
tree_json = json.loads(tree.to_json())
tree_json
{'Country': {'children': [{'England': {'children': [{'East Midlands': {'children': [{'Derby': {'children': ['Mickleover', 'Normanton', 'Oakwood']}}]}}, {'North East': {'children': [{'Hartlepool': {'children': ['Burn Valley']}}]}}, {'North West': {'children': [{'Blackburn with Darwen': {'children': ['Higher Croft', 'Little Harwood', 'Livesey with Pleasington']}}]}}, {'Yorkshire and The Humber': {'children': [{'Kingston upon Hull, City of': {'children': ['Newington', 'Newland', 'Orchard Park and Greenwood']}}]}}]}}]}}
The format of the JSON has interstitial children
elements that make be convenient in some cases, but that may be surplus to requirements in other cases.
Naively, and explicitly, we could start to remove these elements from the tree using something like following code snippet:
tmp_pruned_tree = {'Country':{}}
for region in tree_json['Country']['children']:
for region_key in region.keys():
tmp_pruned_tree['Country'][region_key] = {}
for la in region[region_key]['children']:
for la_key in la.keys():
tmp_pruned_tree['Country'][region_key][la_key] = la[la_key]['children']
tmp_pruned_tree
{'Country': {'England': {'East Midlands': [{'Derby': {'children': ['Mickleover', 'Normanton', 'Oakwood']}}], 'North East': [{'Hartlepool': {'children': ['Burn Valley']}}], 'North West': [{'Blackburn with Darwen': {'children': ['Higher Croft', 'Little Harwood', 'Livesey with Pleasington']}}], 'Yorkshire and The Humber': [{'Kingston upon Hull, City of': {'children': ['Newington', 'Newland', 'Orchard Park and Greenwood']}}]}}}
Once again, we can take inspiration from the literal code to come up with a recursive function that will prune the child nodes for us for any depth tree:
%%writefile -a table2tree.py
def prune_tree(tree, pruned=None, path=None):
"""Prune 'children' nodes from tree."""
# Create a new pruned tree if we haven't yet started...
pruned = {} if pruned is None else pruned
# Convert the tree to a dict if it isn't already in dict form
if isinstance(tree, type(Tree())):
tree = json.loads(tree.to_json())
# Get the first (root) node
path = path if path else next(iter(tree))
# This will be our pruned tree dictionary
pruned[path] = {}
# Now start to check the subtrees...
for subtree in tree[path]['children']:
# If we find into another subtree...
if isinstance(subtree, dict):
# Descend into it...
for subtree_key in subtree.keys():
# Create a new key node for this subtree
pruned[path][subtree_key] = {}
# And carry on pruning down into the tree
prune_tree(subtree, pruned[path], subtree_key)
else:
# We've reached the leaves which add as a list
pruned[path] = tree[path]['children']
return pruned
Appending to table2tree.py
# Run the file as if we had run the code cell
%run table2tree.py
pruned_tree = prune_tree(tree_json)
pruned_tree
{'Country': {'England': {'East Midlands': {'Derby': ['Mickleover', 'Normanton', 'Oakwood']}, 'North East': {'Hartlepool': ['Burn Valley']}, 'North West': {'Blackburn with Darwen': ['Higher Croft', 'Little Harwood', 'Livesey with Pleasington']}, 'Yorkshire and The Humber': {'Kingston upon Hull, City of': ['Newington', 'Newland', 'Orchard Park and Greenwood']}}}}