Notebook

Parsing Hail Data Files¶

The purpose of the following code is to parse the Hail data from the The Alberta Hail Project Meteorological and Barge-Humphries Radar Archive and output it to a .CSV file.

The data consists of hail measurements and observations submitted by Alberta farmers between the period of 1957 and 1985.

The data exists as a directory of .DAT files which were manually converted to .TXT files (by changing the extension). Then, using regular expressions (also known as regex), the data is split apart and recombined into a nested list format allowing for export to a .CSV file.

This code imports the modules and defines the custom functions that are needed to parse the hail data. The comments coloured in light blue and red describe the individual code.

In [1]:

import re
import csv
from os import listdir
from os.path import splitext
from os.path import basename

# This function works on the contents of the files

def read_file(filename):
    "Read the contents of FILENAME and return as a string."
    infile = open(filename)
    contents = infile.read()
    infile.close()
    return contents

# These functions remove the the path and file extension from a filename


def list_textfiles(directory):
    "Return a list of filenames ending in '.txt'"
    textfiles = []
    for filename in listdir(directory):
        if filename.endswith(".txt"):
            textfiles.append(directory + "/" + filename)
    return textfiles


def remove_ext(filename):
    "Removes the file extension, such as .txt"
    name, extension = splitext(filename)
    return name


def remove_dir(filepath):
    "Removes the path from the file name"
    name = basename(filepath)
    return name


def get_filename(filepath):
    "Removes the path and file extension from the file name"
    filename = remove_ext(filepath)
    name = remove_dir(filename)
    return name

Below is a list of the files in the Hail directory that need to be parsed. This list is helpful because we will need to manually change the filename in the next section of code in order to parse each file separately.

The function listdir is called to display the files contained in the specified directory, shown below in red text.

In [2]:

print(listdir('hail/'))

['57HAIL.txt', '58HAIL.txt', '59HAIL.txt', '60HAIL.txt', '61HAIL.txt', '62HAIL.txt', '63HAIL.txt', '64HAIL.txt', '65HAIL.txt', '66HAIL.txt', '67HAIL.txt', '68HAIL.txt', '69HAIL.txt', '74HAIL.txt', '75HAIL.txt', '76HAIL.txt', '77HAIL.txt', '78HAIL.txt', '79HAIL.txt', '80HAIL.txt', '81HAIL.txt', '82HAIL.txt', '83HAIL.txt', '84HAIL.txt', '85HAIL.txt']

This is where the raw data is imported and assigned to a variable. To read a different file from the directory, the filename needs to be changed (based on the above list). It's important to note that the original data is never altered, only the data that is loaded and assigned to the file variable.

In [3]:

file = 'Hail/76HAIL.txt'

This code calls on a custom function get_filename to strip the path and file extension from the filename, storing it in the variable name. This variable will be used later to name the .CSV file so it matches the name of the .TXT file from which the data came.

Then, the data from the variable file is read using the custom function read_file and passed to a regex that matches the format of the file (seen in red text).

In [4]:

name = get_filename(file)
text = read_file(file)

data = re.findall(r'^(\s)(\d{1})(\d{2})(\d{1})(\d{2})(\d{4})(\d{2})(\d{2})(\d{2})(\d{1})(\d{4})(\d{3})(\d{4})(\d{3})(\d{1})(\d{3})(\d{1})(\d{1})(\d{1})(\d{4})(\d{1})(\d{4})(\d{1})(\d{2})(\d{1})(\d{1})(\d{1})(\d{1})(\d{2})(\d{1})(\d{1})(\d{1})', text, re.MULTILINE)

Here's a look at the original data format. Each string of numbers is a separate entry consisting of groups of numbers that correspond to a specific data point, ranging from dates, times, geographical information, hail size and duration, rain size and duration, and damage to crops.

In [5]:

print(file)
print(text[:426])

Hail/76HAIL.txt
 17652510002938244101001510001800254229999920000133990999219
 17652500013532055111001511140111127299999920000199410900419
 17652700100345234999901099991200076229999910025099412900999
 17652910002049064164501216570109999229999920000099411900919
 17652900101038154120000511550259999329999920000125411900999
 17653101002046045163700916250559999299999910152099410900019
 17653100013237075172003017100459999329999930020140431999399

Here is the data formatted by the regex. The first digit indicates the origin of the report (via mail-in card, etc.), the second is the year, and the third and fourth are the month and day. The data was parsed according to the information provided by the accompanying codebook.

In [6]:

print(data[:1])

[(' ', '1', '76', '5', '25', '1000', '29', '38', '24', '4', '1010', '015', '1000', '180', '0', '254', '2', '2', '9', '9999', '2', '0000', '1', '33', '9', '9', '0', '9', '99', '2', '1', '9')]

Now the data can be converted to .CSV. This code uses the name variable that was created earlier to name the file and writes each group of data to one line in the file.

In [7]:

with open(name + '.csv', 'w') as f:
    w = csv.writer(f)
    w.writerows(data)

This code reads all of the files and prints each line to one master .CSV file. The only extra step required is in the fourth chunk of code, where the master list needs to be flattened so that each string of data appears in its own row in the file.

In [8]:

filenames = []
for files in list_textfiles('Hail/'):
    files = get_filename(files)
    filenames.append(files)

In [9]:

docs = []
for filename in list_textfiles('Hail/'):
    docs.append(read_file(filename))

In [10]:

data_2 = []
for doc in docs:
    data_2.append(re.findall(r'^(\s)(\d{1})(\d{2})(\d{1})(\d{2})(\d{4})(\d{2})(\d{2})(\d{2})(\d{1})(\d{4})(\d{3})(\d{4})(\d{3})(\d{1})(\d{3})(\d{1})(\d{1})(\d{1})(\d{4})(\d{1})(\d{4})(\d{1})(\d{2})(\d{1})(\d{1})(\d{1})(\d{1})(\d{2})(\d{1})(\d{1})(\d{1})', doc, re.MULTILINE))

In [11]:

alldata = [line for sublist in data_2 for line in sublist]

In [12]:

with open('allHail.csv', 'w') as f:
    w = csv.writer(f)
    w.writerows(alldata)