The python standard library supports many data types. The ones you find most important will depend on the data you work with most often. Nevertheless, most introductory Python texts list the primary standard data types as numbers, strings, lists, tuples and dictionaries.
There are several standard numerical data types. The ones I encounter most often are floats and integers:
x_int = 20 #this is an integer
x_float = 20.0 #this is a floating point number
If you are using Python 2.x, the important thing to note is that arthimetic operations involving only integers will produce integers:
x = 5
y = 3
z = x/y
print z
1
If one of the numbers involved in the operation is a float then the result is a float:
z = 10/3.
print z
3.33333333333
In Python 3, /
represents floating point division. You can import this behavior into python 2.x, by doing:
from __future__ import division
print 5/2
2.5
Other numerical data types are complex numbers and long integers.
Strings may be defined using single, double or triple quotes. In most cases, you can use single and double quotes interchangeably:
string1 = 'datatypes'
string2 = "Don't stop believin'"
print string1
print string2
datatypes Don't stop believin'
You can access parts of a string by using the slicing syntax:
print string1[0] #first character
print string1[0:2] #first two characters
print string1[2:len(string1)] #everything after second character
print string1[2:5]
print string1[-2:] #last two characters
print string1[0:len(string1)]
print string1[:]
d da tatypes tat es datatypes datatypes
Triple quotes are normally reserved for text spanning several lines and convenient for writing doc strings:
def my_function():
""" This a simple function to demonstrate
the use of triple quotes. You would normally
put some a brief description of your function in this space."""
pass #Tells interpreter: nothing to see here, carry on.
The important thing to note with strings is that they are immutable. That is, you can't change any character of a string after it's created:
a = 'asasa'
#a[2] = 'd' #throws error
A Python list is exactly what you think it is; it is a list or sequence of variables. A list Python list is kinda like a MATLAB cell array. Here are some lists:
num_list = [2, 3, 210, 4]
str_list = ['a', 'sdff', 'asd']
rand_list = [[12, 10], 'a', 1, [10, 'x', 12]]
print "num_list: %s" %num_list
print "str_list: %s" %str_list
print "rand_list: %s" %rand_list
num_list: [2, 3, 210, 4] str_list: ['a', 'sdff', 'asd'] rand_list: [[12, 10], 'a', 1, [10, 'x', 12]]
You can slice into list the same you would a string. To access a list within a list, you would use an extra set of brackets:
print rand_list[0][1]
10
Unlike strings, lists are mutable.
List comprehension is a convenient and powerful short hand for generating lists. Here are some examples:
#generate a list that contains the square of each number in num_list:
num_list_sq = [num**2 for num in num_list ]
print num_list_sq
#generate a list that contains the square of each number in num_list
#if the number is exactly divisible by 3
num_list_sq2 = [num**2 if num%3==0 else num for num in num_list ]
print num_list_sq2
[4, 9, 44100, 16] [2, 9, 44100, 4]
The last list comprehensions is basically a short hand for the code below:
num_list_sq3 = []
for num in num_list:
if num%3==0:
num_list_sq3.append(num**2)
else:
num_list_sq3.append(num)
print num_list_sq3
[2, 9, 44100, 4]
Tuples are like lists except that they are immutable. Here is a tuple:
yr_strs = ('year1', 'year2', 'year3')
yr_nums = 2009, 2010, 2011
tupl = ('a',)
tup_num = tuple(num_list)
print yr_nums
print tup_num
(2009, 2010, 2011) (2, 3, 210, 4)
Tuples are useful as argument specifiers:
print "%s: %s" %(yr_strs[0], yr_nums[1])
year1: 2010
Dictionaries are mutable containers for Python objects. Dictionaries consist of keys
and their contents. Here is a simple dictionary:
contacts = {'John': 6462100, 'Adam': 6461870}
print contacts
{'John': 6462100, 'Adam': 6461870}
Dictionaries are a good way to organize data:
mooring1 = {'Date': '2010-09-13', 'Lat': 10.1,'Lon': 78.5, 'salinity': [28.8, 31.3, 34.5, 35.1]}
print mooring1['Lon']
78.5
Alternatively, one may use the dict
function:
dict_ex = dict(zip(['a','b', 'c', 'd', 'e'], range(5)))
print dict_ex
{'a': 0, 'c': 2, 'b': 1, 'e': 4, 'd': 3}
print dict_ex.keys()
print dict_ex.values()
['a', 'c', 'b', 'e', 'd'] [0, 2, 1, 4, 3]
dict_ex['a']
0
Python dictionaries are similar to MATLAB structs. In fact, when you import a MATLAB structure into Python, the matlab structs are converted to Python dictionaries.
The fundamental NumPy data type is the ndarray:
import numpy as np
x = np.array([12, 10, 3])
print x
[12 10 3]
Here are some ways to create numpy arrays:
a = np.arange(5,10,0.5) #note: endpoint is excluded
print a
[ 5. 5.5 6. 6.5 7. 7.5 8. 8.5 9. 9.5]
print np.linspace(0,20,6) #note: endpoint is included
[ 0. 4. 8. 12. 16. 20.]
b = np.zeros((3,3))
print b
[[ 0. 0. 0.] [ 0. 0. 0.] [ 0. 0. 0.]]
ones3d = np.ones((2,5,2))
print ones3d
[[[ 1. 1.] [ 1. 1.] [ 1. 1.] [ 1. 1.] [ 1. 1.]] [[ 1. 1.] [ 1. 1.] [ 1. 1.] [ 1. 1.] [ 1. 1.]]]
rand_array = np.random.randint(3,10, (4,5))
print rand_array
print rand_array.shape
[[6 9 7 9 7] [6 4 4 3 4] [8 5 3 3 3] [4 9 3 9 4]] (4, 5)
a = np.arange(10)
b = np.ones((10,1))
print a*b.T
[[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]]
comment on 1d vs 2d "vectors"
Another important numpy data type is the masked array. This is useful when working with array that contain "bad" data. Here is how you would create a masked array:
x = np.arange(10)
print x
x_ma = np.ma.masked_where(x>5, x)
print x_ma
print x_ma.mask
[0 1 2 3 4 5 6 7 8 9] [0 1 2 3 4 5 -- -- -- --] [False False False False False False True True True True]
Masked arrays arrays are more convenient that using nans because all the standard numpy operations automatically ignore masked values:
x_ma_mean = np.mean(x_ma)
print x_ma_mean
2.5
If you were using nans, you would need to use the nanmean function to ignore the nans:
y = np.array([1, 2, 3, np.nan, 5])
print np.mean(y)
print np.nanmean(y)
nan 2.75
Here is how you would convert an array that has Nans into a masked array:
y_ma = np.ma.masked_where(np.isnan(y), y, copy=True)
y_ma2 = np.ma.masked_invalid(y)
print y_ma.mean()
y[2] = 10 #if copy is set to False, then this modifies y_ma as well
print y
print y_ma
2.75 [ 1. 2. 10. nan 5.] [1.0 2.0 3.0 -- 5.0]
Structured arrays are essentially numpy arrays that allow you to store data under field names rather than numerical indices. These are convenient for storing a set of data array that are of the same length or have a common axis (e.g. time, depth, etc.). Structured arrays are defined using a dtype
object. There are a several ways to this. In the example below, I define the dtype object using the list method. The list consists of tuples and each tuple specifies a field name, datatype and shape (optional).
Simple example:
dtype_list1 = [('x','f4'), ('y','f4'), ('z','f4')]
coords = np.zeros((5,1), dtype=dtype_list1)
print coords
[[(0.0, 0.0, 0.0)] [(0.0, 0.0, 0.0)] [(0.0, 0.0, 0.0)] [(0.0, 0.0, 0.0)] [(0.0, 0.0, 0.0)]]
Another example:
dtype_list2 = [('temp','f4'), ('sal','f4'), ('lon','f4'), ('lat','f4')]
gridded_data = np.zeros((4,5), dtype=dtype_list2)
print gridded_data
[[(0.0, 0.0, 0.0, 0.0) (0.0, 0.0, 0.0, 0.0) (0.0, 0.0, 0.0, 0.0) (0.0, 0.0, 0.0, 0.0) (0.0, 0.0, 0.0, 0.0)] [(0.0, 0.0, 0.0, 0.0) (0.0, 0.0, 0.0, 0.0) (0.0, 0.0, 0.0, 0.0) (0.0, 0.0, 0.0, 0.0) (0.0, 0.0, 0.0, 0.0)] [(0.0, 0.0, 0.0, 0.0) (0.0, 0.0, 0.0, 0.0) (0.0, 0.0, 0.0, 0.0) (0.0, 0.0, 0.0, 0.0) (0.0, 0.0, 0.0, 0.0)] [(0.0, 0.0, 0.0, 0.0) (0.0, 0.0, 0.0, 0.0) (0.0, 0.0, 0.0, 0.0) (0.0, 0.0, 0.0, 0.0) (0.0, 0.0, 0.0, 0.0)]]
With the above lines, I created a 4x5 structured array of zeros that has three fields: temp
, sal
and lonlat
. Each field is preallocated to store 32-bit floating point numbers, specified by the f4
parameter. Each field is a scalar by default, but they could be n-d arrays. Let's fill the structed array with some arbitrary values:
gridded_data['temp'] = np.random.randint(20,30,(4,5))
gridded_data['sal'] = np.random.randint(30,35,(4,5))
lon = np.linspace(0,20,4)
lat = np.linspace(10,30,5)
lon,lat = np.meshgrid(lat,lon)
gridded_data['lon'] = np.round(lon)
gridded_data['lat'] = np.round(lat)
print gridded_data
[[(24.0, 30.0, 10.0, 0.0) (29.0, 34.0, 15.0, 0.0) (22.0, 33.0, 20.0, 0.0) (28.0, 34.0, 25.0, 0.0) (22.0, 33.0, 30.0, 0.0)] [(29.0, 31.0, 10.0, 7.0) (27.0, 30.0, 15.0, 7.0) (20.0, 30.0, 20.0, 7.0) (28.0, 31.0, 25.0, 7.0) (27.0, 31.0, 30.0, 7.0)] [(20.0, 33.0, 10.0, 13.0) (24.0, 32.0, 15.0, 13.0) (24.0, 33.0, 20.0, 13.0) (20.0, 34.0, 25.0, 13.0) (26.0, 33.0, 30.0, 13.0)] [(20.0, 33.0, 10.0, 20.0) (29.0, 30.0, 15.0, 20.0) (27.0, 31.0, 20.0, 20.0) (24.0, 31.0, 25.0, 20.0) (23.0, 30.0, 30.0, 20.0)]]
You can access a structured array either by using field names, indices or both:
print gridded_data['sal']
[[ 30. 34. 33. 34. 33.] [ 31. 30. 30. 31. 31.] [ 33. 32. 33. 34. 33.] [ 33. 30. 31. 31. 30.]]
print gridded_data['lon']
[[ 10. 15. 20. 25. 30.] [ 10. 15. 20. 25. 30.] [ 10. 15. 20. 25. 30.] [ 10. 15. 20. 25. 30.]]
print gridded_data[['temp', 'lat']]
[[(24.0, 0.0) (29.0, 0.0) (22.0, 0.0) (28.0, 0.0) (22.0, 0.0)] [(29.0, 7.0) (27.0, 7.0) (20.0, 7.0) (28.0, 7.0) (27.0, 7.0)] [(20.0, 13.0) (24.0, 13.0) (24.0, 13.0) (20.0, 13.0) (26.0, 13.0)] [(20.0, 20.0) (29.0, 20.0) (27.0, 20.0) (24.0, 20.0) (23.0, 20.0)]]
Also note the shape of each field:
gridded_data.shape
(4, 5)
np.mean(gridded_data['temp'], axis=0)
array([ 23.25, 27.25, 23.25, 25. , 24.5 ], dtype=float32)
The datetime module provides functions and classes for working dates and times. There is the standard, built-in Python version and NumPy version called datetime64.