by David Paredes (david.paredes@durham.ac.uk) and James Keaveney (james.keaveney@durham.ac.uk)
In python, there are two ways of opening a data file:
open
functionUsually, we will deal with highly structured files, so the second method will be easier in most cases (in the case of big files this is no longer true).
In this tutorial, we will overview two modules/functions, numpy.loadtxt
and csv
, that allow you to get data from comma-separated-value files (csv), where the different data values reside in a file separated by commas, spaces or other delimiters (as the name suggests...)
The function numpy.loadtxt
also present as scipy.loadtxt
in the scipy
package, is very simple to use. You just use the name of the file in string format (that is, separated by normal or double-quotes) and the output is an array with the contents of the file.
For example, let's load the information in the file simpleDataset.csv and store it in variable myDataset
.
import numpy #Remember to import the module
myDataset = numpy.loadtxt("./code/io/csv_example/simpleDataset.csv")
print myDataset
[[ 1. 3.023 ] [ 2. 5.1 ] [ 6. 23. ] [ 6.6 2. ] [ 8. 9.23 ] [ 8.1 10.0001]]
The text file contains two columns of numbers separated by one space. By default, the function takes as a delimiter any white space.
If, for example, the values were separated by commas (as in simpleDatasetComma.txt), we would need to specify the variable delimiter
, which is a string containing the delimiting character(s).
myDatasetComma = numpy.loadtxt("./code/io/csv_example/simpleDatasetComma.txt", delimiter=',')
print myDatasetComma
[[ 1. 3.023 ] [ 2. 5.1 ] [ 6. 23. ] [ 6.6 2. ] [ 8. 9.23 ] [ 8.1 10.0001]]
Some files contain headers that are not part of the data (see fileWithHeader.csv). This file contains 3 lines that give information about the data, but that is not data. The header can be skipped by using keyword skiprows
complicatedDataset = numpy.loadtxt("./code/io/csv_example/fileWithHeader.csv",delimiter=',', skiprows=3)
print complicatedDataset
[[-0.0836 0.479172 0.00209844 0.202813 ] [-0.083598 0.479313 0.00194219 0.202813 ] [-0.083596 0.479313 0.00180156 0.20275 ] [-0.083594 0.479313 0.00191875 0.20277 ] [-0.083592 0.478969 0.00184531 0.20275 ]]
You can also select which columns to extract from the file with the keyword usecols
justSomeCols = numpy.loadtxt("./code/io/csv_example/fileWithHeader.csv",
delimiter=',', skiprows=3, usecols=(1,3))
print justSomeCols
[[ 0.479172 0.202813] [ 0.479313 0.202813] [ 0.479313 0.20275 ] [ 0.479313 0.20277 ] [ 0.478969 0.20275 ]]
You can get more information about numpy.loadtxt
in the scipy page, or using the help
function, as described in the page Basics - Help and information.
The module csv
has a syntax very similar to that of loadtxt
but is much more powerful in the sense that it can handle all sorts of data types. The official documentation states:
__The lack of a standard (for csv files) means that subtle differences often exist in the data produced and consumed by different applications. These differences can make it annoying to process CSV files from multiple sources.___
This modules provides classes to read and write tabular data from/to different formats. The most basic example of reading with this module would be:
import csv
with open("./code/io/csv_example/simpleDataset.csv", 'rb') as ultraImportantFile:
importantReader = csv.reader(ultraImportantFile, delimiter=' ')
for row in importantReader:
print row
['1', '3.023'] ['2', '5.1'] ['6', '23'] ['6.6', '2'] ['8', '9.23'] ['8.1', '10.0001']
As you can see (look at the quotation marks), the reader stores the data as a list of strings separated by the delimiter. This is a default, since a lot of csv files can have heterogeneous data. It is possible to use the function float
to convert the data to floats (for example), like
with open("./code/io/csv_example/simpleDataset.csv", 'rb') as ultraImportantFile:
data = list(csv.reader(ultraImportantFile, delimiter=' '))
print float(data[0][1])
3.023
It is also possible to give significative names to columns. For example, if the two columns in the file represent "current" and "voltage", using the DictReader function
currentData = []
voltData = []
cols = ['current', 'voltage']
with open("./code/io/csv_example/simpleDatasetComma.txt", 'rb') as csvfile:
for row in csv.DictReader(csvfile, fieldnames=cols, delimiter=','):
# Convert non-string data here e.g.:
thiscurrent = float(row['current'])
thisvoltage = float(row['voltage'])
currentData.append(thiscurrent)
voltData.append(thisvoltage)
print currentData, voltData
[1.0, 2.0, 6.0, 6.6, 8.0, 8.1] [3.023, 5.1, 23.0, 2.0, 9.23, 10.0001]
So far we've looked at reading files in, but occasionally it's useful to be able to save processed data. If the data needs to be human-read, or read again outside of the code it's written in, then outputting as a csv is useful. If this is not the case, for example you run a calculation that takes a long time and you need to save the result so that the next time the program is run you just look the result up instead of re-calculating, then pickled (just the binary) data is the easiest format to use. We will give examples of both below.
Let's generate some csv data to eventually export:
import numpy as np
#generate random data
x = np.arange(-10,10,0.01)
y = np.sin(3*x**2)*np.cos(x)**2
# print the first few lines
print x[0:10], y[0:10]
[-10. -9.99 -9.98 -9.97 -9.96 -9.95 -9.94 -9.93 -9.92 -9.91] [-0.70386913 -0.57965427 -0.24754818 0.17987328 0.55413639 0.74256453 0.67521336 0.3704655 -0.06997309 -0.49529355]
Now let's write this into a two-column csv file. We write row-by-row so first we need to convert the data into a 2d-array. We do this using the built-in zip function:
xy = zip(x,y)
# look at the first few lines of this - note the format is different to the previous block
print xy[0:10]
[(-10.0, -0.70386913217899505), (-9.9900000000000002, -0.57965426709639589), (-9.9800000000000004, -0.24754817534860593), (-9.9700000000000006, 0.17987328163207517), (-9.9600000000000009, 0.55413639253598546), (-9.9500000000000011, 0.74256452644996929), (-9.9400000000000013, 0.67521336030495893), (-9.9300000000000015, 0.37046549973305676), (-9.9200000000000017, -0.069973092764842343), (-9.9100000000000019, -0.49529354803092768)]
import csv
filename = './code/io/csv_example/csv_write_example.csv'
with open(filename, 'wb') as csvfile:
csv_writer = csv.writer(csvfile,delimiter=',')
# if header lines are required, they can be written here
header_line = ('Time (ms)', 'Voltage (V)')
csv_writer.writerow(header_line)
# write main block of data
for xy_line in xy:
csv_writer.writerow(xy_line)
That's it. More columns can be added simply by zipping more things together, e.g. zip(x,y,z,...). If you want to look at the csv file we just generated, it's here.
Let's generate some data that takes a while to process. A large 2D-array should do the job nicely for now. Let's plot it as well, while we're on. And for comparison purposes, let's time how long it takes...
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import time
st = time.clock()
# make large arrays
x = np.arange(-100,100,0.1)
y = np.arange(-200,200,0.2)
X,Y = np.meshgrid(x,y)
Z = np.exp(-(X**2+Y**2)/80**2)*np.cos(np.sqrt(X**2+Y**2)/20)**2
print 'Elapsed time (ms):', (time.clock() - st)*1e3
plt.imshow(Z)
Elapsed time (ms): 297.011865079
<matplotlib.image.AxesImage at 0x7cca898>
Ok, 4 million points (2k x 2k) takes a little while to process. Let's save that as a csv, and then save it by pickling, and then try and read them back in:
#save csv
st = time.clock()
import csv
fn_csv = './code/io/csv_example/big_array.csv'
with open(fn_csv,'wb') as csvfile:
csv_writer = csv.writer(csvfile)
for z_line in Z:
csv_writer.writerow(z_line)
print 'How long did that take? (s)', time.clock() - st
# how big is this file..?
import os
print 'File size (MB):', os.path.getsize(fn_csv)/2**20
How long did that take? (s) 5.22988936046 File size (MB): 80
Quite a large file...
Let's try pickle instead...
#now pickle it instead
import cPickle as pickle
fn_pkl = './code/io/csv_example/big_array.pkl' # note you can have whatever extension you want here,
# or none at all, but I prefer a sensible extension
# pickle it
st = time.clock()
pickle.dump(Z,open(fn_pkl,'wb'))
print 'And this time... (s)', time.clock() - st
print 'File size (MB):', os.path.getsize(fn_pkl)/2**20
And this time... (s) 2.53754202893 File size (MB): 80
So the file sizes are the same, but the time taken is much longer by the csv writer. Now let's try reading them back in. Here's a pretty generic code for reading csv files to data arrays:
def read_file_data(filename):
with open(filename,'U') as f:
DataIn = csv.reader(f,delimiter=',')
DataOut = []
# find number of columns
DataLine = DataIn.next()
NCols = len(DataLine)
for i in range(NCols):
DataOut.append([])
print NCols,len(DataOut)
for row in DataIn:
for i in range(NCols):
try:
DataOut[i].append(float(row[i]))
except: # if not numeric
DataOut[i].append(0)
return DataOut
#read in csv file
st = time.clock()
Z = read_file_data(fn_csv)
print 'Elapsed time (ms):', (time.clock() - st)*1e3
2000 2000 Elapsed time (ms): 3473.91947552
Now compare with reading in from the pickled file
st = time.clock()
Z = pickle.load(open(fn_pkl,'rb'))
print 'Elapsed time (ms):', (time.clock() - st)*1e3
Elapsed time (ms): 2868.40562932
This might not seem like much of a speed-up, but if the file sizes get larger then this becomes a sizeable increase in performance.
However, the pickle module is most useful when storing many data types, as there is no need to format the data before saving:
# pickle example for mixed data types
# numpy 1d array
x = np.arange(-100,100,0.01)
# numpy 2d array
y = np.ones((500,500))
# list of mixed type
z = [1,4,6,0,'abcde']
# string
a = 'this is a string'
# tuple
b = (42,'anything')
Now let's say we want to save all of this data. Instead of writing many files, it can all be bundled into one pickled file:
fn_pkl = './code/io/csv_example/multi_out.pkl'
#save
pickle.dump([x,y,z,a,b],open(fn_pkl,'wb'))
And then to read it back in:
x2,y2,z2,a2,b2 = pickle.load(open(fn_pkl,'rb'))
#check the data is the same as what we put in...
print 'Data arrays same?\n', z2==z, a2==a, b2==b
Data arrays same? True True True