#!/usr/bin/env python # coding: utf-8 # # Introduction to programming for Geoscientists through Python # ### [Gerard Gorman](http://www.imperial.ac.uk/people/g.gorman), [Nicolas Barral](http://www.imperial.ac.uk/people/n.barral) # # Lecture 6: Files, strings, and dictionaries # Learning objectives: You will learn how to: # # * Read data in from a file # * Parse strings to extract specific data of interest. # * Use dictionaries to index data using any type of key. # ## Reading data from a plain text file # We can read text from a [text file](http://en.wikipedia.org/wiki/Text_file) into strings in a program. This is a common (and simple) way for a program to get input data. The basic recipe is: # Open text file infile = open("myfile.dat", "r") # Read next line: line = infile.readline() # Read the lines in a loop one by one: for line in infile: # Load all lines into a list of strings: lines = infile.readlines() for line in lines: # Let's look at the file [data1.txt](https://raw.githubusercontent.com/ggorman/Introduction-to-programming-for-geoscientists/master/notebook/data/data1.txt) (all of the data files in this lecture are stored in the sub-folder *data/* of this notebook library). The files has a column of numbers: 21.8 18.1 19 23 26 17.8 # The goal is to read this file and calculate the mean: # In[1]: # Open data file infile = open("data/data1.txt", "r") # Initialise values mean = 0 n=0 # Loop to perform sum for number in infile: number = float(number) mean = mean + number n += 1 # It is good practice to close a file when you are finished. infile.close() # Calculate the mean. mean = mean/n print(mean) # Let's make this example more interesting. There is a **lot** of data out there for you to discover all kinds of interesting facts - you just need to be interested in learning a little analysis. For this case I have downloaded tidal gauge data for the port of Avonmouth from the [BODC](http://www.bodc.ac.uk/). If you look at the header of file [data/2012AVO.txt](https://raw.githubusercontent.com/ggorman/Introduction-to-programming-for-geoscientists/master/notebook/data/2012AVO.txt) you will see the [metadata](http://en.wikipedia.org/wiki/Metadata): Port: P060 Site: Avonmouth Latitude: 51.51089 Longitude: -2.71497 Start Date: 01JAN2012-00.00.00 End Date: 30APR2012-23.45.00 Contributor: National Oceanography Centre, Liverpool Datum information: The data refer to Admiralty Chart Datum (ACD) Parameter code: ASLVTD02 = Surface elevation (unspecified datum) of the water body by fixed in-situ pressure sensor # Let's read the column ASLVTD02 (the surface elevation) and plot it: # In[2]: from pylab import * tide_file = open("data/2012AVO.txt", "r") # We know from inspecting the file that the first 11 lines are just # header information so lets just skip those lines. for i in range(11): line = tide_file.readline() # Initialise an empty list to store the elevation elevation = [] days = [] # Now we start reading the interesting data n=0 while True: # This will keep looping until we break out. # Here we use a try/except block to try to read the data as normal # and to break out if unsuccessful - ie when we reach the end of the file. try: # Read the next line line = tide_file.readline() # Split this line into words. words = line.split() # If we do not have 5 words then it must be blank lines at the end of the file. if len(words)!=5: break except: # If we failed to read a line then we must have got to the end. break n+=1 # Count number of data points try: # The elevation data is on the 4th column. However, the BODC # appends a "M" when a value is improbable and an "N" when # data is missing (maybe a ship dumped into it during rough weather!) # Therefore, we put this conversion from a string into a float in a # try/except block. level = float(words[3]) elevation.append(level) # There is a measurement every quarter hour. days.append(n*0.25/24) except: continue # For plotting lets convert the list to a NumPy array. elevation = array(elevation) days = array(days) plot(days, elevation) xlabel("Days") ylabel("Elevation (meters)") show() # Quiz time: # # * What tidal constituents can you identify by looking at this plot? # * Is this primarily a diurnal or semi-diurnal tidal region? (hint - change the x-axis range on the plot above). # # You will notice in the above example that we used the *split()* string member function. This is a very useful function for grabbing individual words on a line. When called without any arguments it assumes that the [delimiter](http://en.wikipedia.org/wiki/Delimiter) is a blank space. However, you can use this to split a string with any delimiter, *e.g.*, *line.split(';')*, *line.split(':')*. # ## Exercise 6.1: Read a two-column data file # The file [data/xy.dat](https://raw.githubusercontent.com/ggorman/Introduction-to-programming-for-geoscientists/master/notebook/data/xy.dat) contains two columns of numbers, corresponding to *x* and *y* coordinates on a curve. The start of the file looks like this: # # -1.0000 -0.0000
# -0.9933 -0.0087
# -0.9867 -0.0179
# -0.9800 -0.0274
# -0.9733 -0.0374
# # Make a program that reads the first column into a list *x* and the second column into a list *y*. Then convert the lists to arrays, and plot the curve. Print out the maximum and minimum y coordinates. (Hint: Read the file line by line, split each line into words, convert to float, and append to *x* and *y*.)
# In[ ]: # ## Exercise 6.2: Read a data file # The files [data/density_water.dat](https://raw.githubusercontent.com/ggorman/Introduction-to-programming-for-geoscientists/master/notebook/data/density_water.dat) and [data/density_air.dat](https://raw.githubusercontent.com/ggorman/Introduction-to-programming-for-geoscientists/master/notebook/data/density_air.dat) contain data about the density of water and air (respectively) for different temperatures. The data files have some comment lines starting with # and some lines are blank. The rest of the lines contain density data: the temperature in the first column and the corresponding density in the second column. The goal of this exercise is to read the data in such a file, discard commented or blank lines, and plot the density versus the temperature as distinct (small) circles for each data point. Apply the program to both files. # In[ ]: # ## Exercise 6.3: Read acceleration data and find velocities # A file [data/acc.dat](https://raw.githubusercontent.com/ggorman/Introduction-to-programming-for-geoscientists/master/notebook/data/acc.dat) contains measurements $a_0, a_1, \ldots, a_{n-1}$ of the acceleration of an object moving along a straight line. The measurement $a_k$ is taken at time point $t_k = k\Delta t$, where $\Delta t$ is the time spacing between the measurements. The purpose of the exercise is to load the acceleration data into a program and compute the velocity $v(t)$ of the object at some time $t$. # # In general, the acceleration $a(t)$ is related to the velocity $v(t)$ through $v^\prime(t) = a(t)$. This means that # # $$ # v(t) = v(0) + \int_0^t{a(\tau)d\tau} # $$ # # If $a(t)$ is only known at some discrete, equally spaced points in time, $a_0, \ldots, a_{n-1}$ (which is the case in this exercise), we must compute the integral above numerically, for example by the Trapezoidal rule: # # $$ # v(t_k) \approx v(0) + \Delta t \left(\frac{1}{2}a_0 + \frac{1}{2}a_k + \sum_{i=1}^{k-1}a_i \right), \ \ 1 \leq k \leq n-1. # $$ # # We assume $v(0) = 0$ so that also $v_0 = 0$. # Read the values $a_0, \ldots, a_{n-1}$ from file into an array, plot the acceleration versus time, and use the Trapezoidal rule to compute one $v(t_k)$ value. Experiment with different values of $\Delta t$ and $k$. # In[ ]: # ## Python dictionaries # Suppose we need to store the temperatures in Oslo, London and Paris. The Python list solution might look like: # In[3]: temps = [13, 15.4, 17.5] # temps[0]: Oslo # temps[1]: London # temps[2]: Paris # In this case we need to remember the mapping between the index and the city name. It would be easier to specify name of city to get the temperature. Containers such as lists and arrays use a continuous series of integers to index elements. However, for many applications such an integer index is not useful. # # **Dictionaries** are containers where any Python object can be used # as an index. Let's rewrite the previous example using a Python dictionary: # In[4]: temps = {"Oslo": 13, "London": 15.4, "Paris": 17.5} print("The temperature in London is", temps["London"]) # Add a new element to a dictionary: # In[5]: temps["Madrid"] = 26.0 print(temps) # Loop (iterate) over a dictionary: # In[6]: for city in temps: print("The temperature in %s is %g" % (city, temps[city])) # The index in a dictionary is called the **key**. A dictionary is said to hold key–value pairs. So in general: for key in dictionary: value = dictionary[key] print(value) # Does the dictionary have a particular key (*i.e.* a particular data entry)? # In[7]: if "Berlin" in temps: print("We have Berlin and its temperature is ", temps["Berlin"]) else: print("I don't know Berlin' termperature.") # In[8]: print("Oslo" in temps) # i.e. standard boolean expression # The keys and values can be reached as lists: # In[9]: print("Keys = ", temps.keys()) print("Values = ", temps.values()) # Note that the sequence of keys is **arbitrary**! Never rely on it, if you need a specific order of the keys then you should explicitly sort: # In[10]: for key in sorted(temps): value = temps[key] print(key, value) # Remove Oslo key:value: # In[11]: del temps["Oslo"] # remove Oslo key w/value print(temps, len(temps)) # Similarly to what we saw for arrays, two variables can refer to the same dictionary: # In[12]: t1 = temps t1["Stockholm"] = 10.0 print(temps) # So we can see that while we modified *t1*, the *temps* dictionary was also changed. # Let's look at a simple example of reading the same data from a file and putting it into a dictionary. We will be reading the file *data/deg2.dat*. # In[13]: infile = open("data/deg2.dat", "r") # Start with empty dictionary temps = {} for line in infile: # If you examine the file you will see a ':' after the city name, # so let's use this as the delimiter for splitting the line. city, temp = line.split(":") temps[city] = float(temp) infile.close() print(temps) # ## Exercise 6.4: Make a dictionary from a table # The file [data/constants.txt](https://raw.githubusercontent.com/ggorman/Introduction-to-programming-for-geoscientists/master/notebook/data/constants.txt) contains a table of the values and the dimensions of some fundamental constants from physics. We want to load this table into a dictionary *constants*, where the keys are the names of the constants. For example, *constants['gravitational constant']* holds the value of the gravitational constant (6.67259 $\times$ 10$^{-11}$) in Newton's law of gravitation. Make a function that that reads and interprets the text in the file, and thereafter returns the dictionary. # In[ ]: # ## Exercise 6.5: Explore syntax differences: lists vs. dictionaries # Consider this code: # In[14]: t1 = {} t1[0] = -5 t1[1] = 10.5 # Explain why the lines above work fine while the ones below do not: # In[15]: t2 = [] t2[0] = -5 t2[1] = 10.5 # What must be done in the last code snippet to make it work properly? # ## Exercise 6.6: Compute the area of a triangle # An arbitrary triangle can be described by the coordinates of its three vertices: $(x_1, y_1), (x_2, y_2), (x_3, y_3)$, numbered in a counterclockwise direction. The area of the triangle is given by the formula: # # $A = \frac{1}{2}|x_2y_3 - x_3y_2 - x_1y_3 + x_3y_1 + x_1y_2 - x_2y_1|.$ # # Write a function *area(vertices)* that returns the area of a triangle whose vertices are specified by the argument vertices, which is a nested list of the vertex coordinates. For example, vertices can be [[0,0], [1,0], [0,2]] if the three corners of the triangle have coordinates (0, 0), (1, 0), and (0, 2). # # Then, assume that the vertices of the triangle are stored in a dictionary and not a list. The keys in the dictionary correspond to the vertex number (1, 2, or 3) while the values are 2-tuples with the x and y coordinates of the vertex. For example, in a triangle with vertices (0, 0), (1, 0), and (0, 2) the vertices argument becomes: # In[ ]: # ## String manipulation # Text in Python is represented as **strings**. Programming with strings is therefore the key to interpret text in files and construct new text (*i.e.* **parsing**). First we show some common string operations and then we apply them to real examples. Our sample string used for illustration is: # In[16]: s = "Berlin: 18.4 C at 4 pm" # Strings behave much like lists/tuples - they are simply a sequence of characters: # In[17]: print("s[0] = ", s[0]) print("s[1] = ", s[1]) # Substrings are just slices of lists and arrays: # In[18]: # from index 8 to the end of the string print(s[8:]) # In[19]: # index 8, 9, 10 and 11 (not 12!) print(s[8:12]) # In[20]: # from index 8 to 8 from the end of the string print(s[8:-8]) # You can also find the start of a substring: # In[21]: # where does "Berlin" start? print(s.find("Berlin")) # In[22]: print(s.find("pm")) # In[23]: print (s.find("Oslo")) # In this last example, Oslo does not exist in the list so the return value is -1. # We can also check if a substring is contained in a string: # In[24]: print ("Berlin" in s) # In[25]: print ("Oslo" in s) # In[26]: if "C" in s: print("C found") else: print("C not found") # ### Search and replace # Strings also support substituting a substring by another string. In general this looks like *s.replace(s1, s2)*, which replaces string *s1* in *s* by string *s2*, *e.g.*: # In[27]: s = s.replace(" ", "_") print(s) # In[28]: s = s.replace("Berlin", "Bonn") print(s) # In[29]: # Replace the text before the first colon by 'London' s = s.replace(s[:s.find(":")], "London") print(s) # Notice that in all these examples we assign the new result back to *s*. One of the reasons we are doing this is strings are actually constant (*i.e* immutable) and therefore cannot be modified *inplace*. We **cannot** write for example: s[18] = '5' TypeError: "str" object does not support item assignment # We also encountered examples above where we used the split function to break up a line into separate substrings for a given separator (where a space is the default delimiter). Sometimes we want to split a string into lines - *i.e.* the delimiter is the [carriage return](http://en.wikipedia.org/wiki/Carriage_return). This can be surprisingly tricky because different computing platforms (*e.g.* Windows, Linux, Mac) use different characters to represent a carriage return. For example, Unix uses '\n'. Luckly Python provides a *cross platform* way of doing this so regardless of what platform created the data file, or what platform you are running Python on, it will do the *right thing*: # In[30]: t = "1st line\n2nd line\n3rd line" print ("""original t = """, t) # In[31]: # This works here but will give you problems if you are switching # files between Windows and either Mac or Linux. print (t.split("\n")) # In[32]: # Cross platform (ie better) solution print(t.splitlines()) # ### Stripping off leading/trailing whitespace # When processing text from a file and composing new strings, we frequently need to trim leading and trailing whitespaces: # In[33]: s = " text with leading and trailing spaces \n" print("-->%s<--"%s.strip()) # In[34]: # left strip print("-->%s<--"%s.lstrip()) # In[35]: # right strip print("-->%s<--"%s.rstrip()) # ### join() (the opposite of split()) # We can join a list of substrings to form a new string. Similarly to *split()* we put strings together with a delimiter inbetween: # In[36]: strings = ["Newton", "Secant", "Bisection"] print(", ".join(strings)) # You can prove to yourself that these are inverse operations: t = delimiter.join(stringlist) stringlist = t.split(delimiter) # As an example, let's split off the first two words on a line: # In[37]: line = "This is a line of words separated by space" words = line.split() print("words = ", words) line2 = " ".join(words[2:]) print("line2 = ", line2) # ## Exercise 6.7: Improve a program # The file [data/densities.dat](https://raw.githubusercontent.com/ggorman/Introduction-to-programming-for-geoscientists/master/notebook/data/densities.dat) contains a table of densities of various substances measured in g/cm$^3$. The following program reads the data in this file and produces a dictionary whose keys are the names of substances, and the values are the corresponding densities. # In[40]: def read_densities(filename): infile = open(filename, 'r') densities = {} for line in infile: words = line.split() density = float(words[-1]) if len(words[:-1]) == 2: substance = words[0] + ' ' + words[1] else: substance = words[0] densities[substance] = density infile.close() return densities densities = read_densities('data/densities.dat') # One problem we face when implementing the program above is that the name of the substance can contain one or two words, and maybe more words in a more comprehensive table. The purpose of this exercise is to use string operations to shorten the code and make it more general. Implement the following two methods in separate functions in the same program, and control that they give the same result. # # 1. Let *substance* consist of all the words but the last, using the join method in string objects to combine the words. # 2. Observe that all the densities start in the same column file and use substrings to divide line into two parts. (Hint: Remember to strip the first part such that, e.g., the density of ice is obtained as *densities['ice']* and not *densities['ice ']*.) # ##File writing # Writing a file in Python is simple. You just collect the text you want to write in one or more strings and, for each string, use a statement along the lines of outfile.write(string) # The write function does not add a newline character so you may have to do that explicitly: outfile.write(string + ’\n’) # That’s it! Compose the strings and write! Let's do an example. Write a nested list (table) to a file: # In[41]: # Let's define some table of data data = [[ 0.75, 0.29619813, -0.29619813, -0.75 ], [ 0.29619813, 0.11697778, -0.11697778, -0.29619813], [-0.29619813, -0.11697778, 0.11697778, 0.29619813], [-0.75, -0.29619813, 0.29619813, 0.75 ]] # Open the file for writing. Notice the "w" indicates we are writing! outfile = open("tmp_table.dat", "w") for row in data: for column in row: outfile.write("%14.8f" % column) outfile.write("\n") # ensure newline outfile.close() # And that's it - run the above cell and take a look at the file that was generated in your Azure library clone. # ## Exercise 6.8: Write function data to a file # We want to dump $x$ and $f(x)$ values to a file named function_data.dat, where the $x$ values appear in the first column and the $f(x)$ values appear in the second. Choose $n$ equally spaced $x$ values in the interval [-4, 4]. Here, the function $f(x)$ is given by: # # $f(x) = \frac{1}{\sqrt{2\pi}}\exp(-0.5x^2)$ # In[ ]: