Introduction:
This program uses Python's regular expression or regex capabilities to extract information/data using Python's re
module. To regex some meaniningful expressions, a data file containing meterological information obtained from Australia's Bureau of Meteorology (BOM) is used. A set of alpha-numerical and numerical datas are extracted using Python's regex. Although note that regex is not the most efficient way to carry out this type of data analysis and interpretation. Here it is just used to show the versatality of regex and to learn using it for Python.
%matplotlib inline
import sys
import re
import os
import matplotlib.pyplot as plt
First, we do the importing business. Here, we import some basic modules in Python such as sys and os. The two new probably are re
, which is the regular expression or regex module and, matplotlib
is the Python's plotting library.
headerPattern = re.compile(r'(?P<pCode>[A-Za-z]+[ ]?[a-z]+),(?P<sNumber>(\w+[ ]?)+),(?P<ymd>(\w+,?){,3}),(?P<maxT>(\w+[ ]?\w+)),')
The first regex operation that we carry out is the compilation of the header pattern in the data.txt
file. Compiling the pattern convertes to bytecode, which is then subsequently used to search. At this stage, its ideal to look at the data.txt
if not already and to find the header information. As we see, the header contains Product code
, Bureau of Meteorology station number
, Year
, month
, day
, etc and we try to extract all of these via the headerPattern
as shown above. Note that we use raw string r
notation in Python and we also group the regex using ( )
. Grouping regex is a usefull feature when extracting multiple information and they can be assigned with a choice of variable. For example in the above, the Produce code
is assigned to pCode
and subsequently when we search for Product code
we can grab all the associated strings via pCode
. This feature will be become clear as we go further.
productCode = re.compile(r'(?P<pC>\w+)')
stationNumber = re.compile(r'\w+,(?P<sN>\d+)')
ymd = re.compile(r'\w+,\d+,(?P<year>\d+),(?P<month>\d+),(?P<day>\d+)')
maxTemp = re.compile(r'\w+,(\d+,){,4}(?P<mT>\d+.\d+)')
The above are again the regex compiling for the actual data present in the data.txt
that we are intereted to extract. Note the grouping and how the variables are assigned for each of the regex.
monthValue = []
dayValue = []
maxTempValue = []
The above are the empty lists and will be used to update the relevant information that is obtained when the regex are searched in the data file.
with open('data.txt', 'r') as data:
for line in data:
if not 'IDCJAC0010' in line:
headerSearch = headerPattern.search(line.strip())
print headerSearch.group('pCode')
print headerSearch.group('sNumber')
print headerSearch.group('ymd')
print headerSearch.group('maxT')
if not 'Product' in line:
pCSearch = productCode.search(line.strip())
sNSearch = stationNumber.search(line.strip())
ymdSearch = ymd.search(line.strip())
maxTempSearch = maxTemp.search(line.strip())
month = ymdSearch.group('month')
day = ymdSearch.group('day')
mTemp = maxTempSearch.group('mT')
monthValue.append(month)
dayValue.append(day)
maxTempValue.append(mTemp)
Product code Bureau of Meteorology station number Year,Month,Day Maximum temperature
In the above piece of code, we open the data.txt
for reading and then firstly search the header pattern. Here, note that how we use the variables that we have assigned such as pCode
, sNumber
, etc to store the search results. In a similar fashion, we search the relevant data. After all the regexs are found, the corresponding data are appended to the empty lists that we had before.
dataValues = zip(monthValue, dayValue, maxTempValue)
print len(dataValues)
365
Here, we basically zip
the three important lists making dataValues
as a tuple.
janValues = dataValues[0:31]
janDay = [x[1] for x in janValues]
janTemp = [x[2] for x in janValues]
This is where it gets a bit interesting! In the previous step, the length of the dataValues
is printed as 365
which indicates a year worth of data. Now since we are interested only in the January, which has 31
days, we create another variable called janValues
and slice the tuple from 0
to 31
. Note that janValues
is also a tuple with the 31
data sets. Next we use list comprehension to disect the tuple. janValues
has three lists for every set of data representing month, day and maximum temperature, in which the index[0] would correspond to the first list which is month, index[1] for day and index[2] for temperature.
febValues = dataValues[len(janValues)+1:len(janValues)+28]
febDay = [x[1] for x in febValues]
febTemp = [x[2] for x in febValues]
Similar to the month of January, we slice the dataValues
into 28
days and assign the day and temperature variables.
if janValues:
plt.plot(janDay, janTemp)
plt.xlabel('Days')
plt.ylabel('Temperature [C]')
plt.xlim(1,len(janValues))
plt.title('Maximum temperature for the month of January, Station ID: IDCJAC0010')
plt.show()
else:
print "Evaluate January values"
if febValues:
plt.plot(febDay,febTemp)
plt.xlabel('Days')
plt.ylabel('Temperature [C]')
plt.xlim(1, len(febValues))
plt.title('Maximum temperature for the month of Februrary, Station ID: IDCJAC0010')
plt.show()
else:
print "Evaluate Februrary values"