Standard data types¶

The python standard library supports many data types. The ones you find most important will depend on the data you work with most often. Nevertheless, most introductory Python texts list the primary standard data types as numbers, strings, lists, tuples and dictionaries.

Numbers¶

There are several standard numerical data types. The ones I encounter most often are floats and integers:

In [1]:

x_int = 20 #this is an integer
x_float = 20.0 #this is a floating point number

If you are using Python 2.x, the important thing to note is that arthimetic operations involving only integers will produce integers:

In [2]:

x = 5
y = 3
z = x/y
print z

If one of the numbers involved in the operation is a float then the result is a float:

In [3]:

z = 10/3.
print z

3.33333333333

In Python 3, / represents floating point division. You can import this behavior into python 2.x, by doing:

In [4]:

from __future__ import division

print 5/2

2.5

Other numerical data types are complex numbers and long integers.

Strings¶

Strings may be defined using single, double or triple quotes. In most cases, you can use single and double quotes interchangeably:

In [5]:

string1 = 'datatypes'
string2 = "Don't stop believin'"
print string1
print string2

datatypes
Don't stop believin'

You can access parts of a string by using the slicing syntax:

In [6]:

print string1[0] #first character
print string1[0:2] #first two characters
print string1[2:len(string1)] #everything after second character
print string1[2:5]
print string1[-2:] #last two characters
print string1[0:len(string1)]
print string1[:]

d
da
tatypes
tat
es
datatypes
datatypes

Triple quotes are normally reserved for text spanning several lines and convenient for writing doc strings:

In [7]:

def my_function():
    """ This a simple function to demonstrate
    the use of triple quotes. You would normally
    put some a brief description of your function in this space."""
    pass #Tells interpreter: nothing to see here, carry on.

The important thing to note with strings is that they are immutable. That is, you can't change any character of a string after it's created:

In [8]:

a = 'asasa'
#a[2] = 'd' #throws error

Lists¶

A Python list is exactly what you think it is; it is a list or sequence of variables. A list Python list is kinda like a MATLAB cell array. Here are some lists:

In [9]:

num_list = [2, 3, 210, 4]
str_list = ['a', 'sdff', 'asd']
rand_list = [[12, 10], 'a', 1, [10, 'x', 12]]

print "num_list: %s" %num_list
print "str_list: %s" %str_list
print "rand_list: %s" %rand_list

num_list: [2, 3, 210, 4]
str_list: ['a', 'sdff', 'asd']
rand_list: [[12, 10], 'a', 1, [10, 'x', 12]]

You can slice into list the same you would a string. To access a list within a list, you would use an extra set of brackets:

In [10]:

print rand_list[0][1]

Unlike strings, lists are mutable.

List comprehensions¶

List comprehension is a convenient and powerful short hand for generating lists. Here are some examples:

In [11]:

#generate a list that contains the square of each number in num_list:
num_list_sq = [num**2 for num in num_list ]
print num_list_sq

#generate a list that contains the square of each number in num_list
#if the number is exactly divisible by 3
num_list_sq2 = [num**2 if num%3==0 else num for num in num_list ]
print num_list_sq2

[4, 9, 44100, 16]
[2, 9, 44100, 4]

The last list comprehensions is basically a short hand for the code below:

In [12]:

num_list_sq3 = []
for num in num_list:
    if num%3==0:
        num_list_sq3.append(num**2)
    else: 
        num_list_sq3.append(num)

print num_list_sq3

[2, 9, 44100, 4]

Tuples¶

Tuples are like lists except that they are immutable. Here is a tuple:

In [13]:

yr_strs = ('year1', 'year2', 'year3')
yr_nums = 2009, 2010, 2011
tupl = ('a',)

tup_num = tuple(num_list)
print yr_nums
print tup_num

(2009, 2010, 2011)
(2, 3, 210, 4)

Tuples are useful as argument specifiers:

In [14]:

print "%s: %s" %(yr_strs[0], yr_nums[1])

year1: 2010

Dictionaries¶

Dictionaries are mutable containers for Python objects. Dictionaries consist of keys and their contents. Here is a simple dictionary:

In [15]:

contacts = {'John': 6462100, 'Adam': 6461870}

In [16]:

print contacts

{'John': 6462100, 'Adam': 6461870}

Dictionaries are a good way to organize data:

In [17]:

mooring1 = {'Date': '2010-09-13', 'Lat': 10.1,'Lon': 78.5, 'salinity': [28.8, 31.3, 34.5, 35.1]}
print mooring1['Lon']

78.5

Alternatively, one may use the dict function:

In [18]:

dict_ex = dict(zip(['a','b', 'c', 'd', 'e'], range(5)))
print dict_ex

{'a': 0, 'c': 2, 'b': 1, 'e': 4, 'd': 3}

In [19]:

print dict_ex.keys()
print dict_ex.values()

['a', 'c', 'b', 'e', 'd']
[0, 2, 1, 4, 3]

In [20]:

dict_ex['a']

Out[20]:

Python dictionaries are similar to MATLAB structs. In fact, when you import a MATLAB structure into Python, the matlab structs are converted to Python dictionaries.

NumPy datatypes¶

N-dimensional arrays¶

The fundamental NumPy data type is the ndarray:

In [21]:

import numpy as np

x = np.array([12, 10, 3])
print x

[12 10  3]

Here are some ways to create numpy arrays:

In [22]:

a = np.arange(5,10,0.5) #note: endpoint is excluded
print a

[ 5.   5.5  6.   6.5  7.   7.5  8.   8.5  9.   9.5]

In [23]:

print np.linspace(0,20,6) #note: endpoint is included

[  0.   4.   8.  12.  16.  20.]

In [24]:

b = np.zeros((3,3))
print b

[[ 0.  0.  0.]
 [ 0.  0.  0.]
 [ 0.  0.  0.]]

In [25]:

ones3d = np.ones((2,5,2))
print ones3d

[[[ 1.  1.]
  [ 1.  1.]
  [ 1.  1.]
  [ 1.  1.]
  [ 1.  1.]]

 [[ 1.  1.]
  [ 1.  1.]
  [ 1.  1.]
  [ 1.  1.]
  [ 1.  1.]]]

In [26]:

rand_array = np.random.randint(3,10, (4,5))
print rand_array
print rand_array.shape

[[6 9 7 9 7]
 [6 4 4 3 4]
 [8 5 3 3 3]
 [4 9 3 9 4]]
(4, 5)

In [27]:

a = np.arange(10)
b = np.ones((10,1))
print a*b.T

[[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.]]

comment on 1d vs 2d "vectors"

Masked arrays¶

Another important numpy data type is the masked array. This is useful when working with array that contain "bad" data. Here is how you would create a masked array:

In [28]:

x = np.arange(10)
print x
x_ma = np.ma.masked_where(x>5, x)
print x_ma
print x_ma.mask

[0 1 2 3 4 5 6 7 8 9]
[0 1 2 3 4 5 -- -- -- --]
[False False False False False False  True  True  True  True]

Masked arrays arrays are more convenient that using nans because all the standard numpy operations automatically ignore masked values:

In [29]:

x_ma_mean = np.mean(x_ma)
print x_ma_mean

2.5

If you were using nans, you would need to use the nanmean function to ignore the nans:

In [30]:

y = np.array([1, 2, 3, np.nan, 5])
print np.mean(y)
print np.nanmean(y)

nan
2.75

Here is how you would convert an array that has Nans into a masked array:

In [31]:

y_ma = np.ma.masked_where(np.isnan(y), y, copy=True)
y_ma2 = np.ma.masked_invalid(y)
print y_ma.mean()
y[2] = 10 #if copy is set to False, then this modifies y_ma as well
print y
print y_ma

2.75
[  1.   2.  10.  nan   5.]
[1.0 2.0 3.0 -- 5.0]

Structured arrays¶

Structured arrays are essentially numpy arrays that allow you to store data under field names rather than numerical indices. These are convenient for storing a set of data array that are of the same length or have a common axis (e.g. time, depth, etc.). Structured arrays are defined using a dtype object. There are a several ways to this. In the example below, I define the dtype object using the list method. The list consists of tuples and each tuple specifies a field name, datatype and shape (optional).

Simple example:

In [32]:

dtype_list1 = [('x','f4'), ('y','f4'), ('z','f4')]
coords = np.zeros((5,1), dtype=dtype_list1)
print coords

[[(0.0, 0.0, 0.0)]
 [(0.0, 0.0, 0.0)]
 [(0.0, 0.0, 0.0)]
 [(0.0, 0.0, 0.0)]
 [(0.0, 0.0, 0.0)]]

Another example:

In [33]:

dtype_list2 = [('temp','f4'), ('sal','f4'), ('lon','f4'), ('lat','f4')]
gridded_data = np.zeros((4,5), dtype=dtype_list2)
print gridded_data

[[(0.0, 0.0, 0.0, 0.0) (0.0, 0.0, 0.0, 0.0) (0.0, 0.0, 0.0, 0.0)
  (0.0, 0.0, 0.0, 0.0) (0.0, 0.0, 0.0, 0.0)]
 [(0.0, 0.0, 0.0, 0.0) (0.0, 0.0, 0.0, 0.0) (0.0, 0.0, 0.0, 0.0)
  (0.0, 0.0, 0.0, 0.0) (0.0, 0.0, 0.0, 0.0)]
 [(0.0, 0.0, 0.0, 0.0) (0.0, 0.0, 0.0, 0.0) (0.0, 0.0, 0.0, 0.0)
  (0.0, 0.0, 0.0, 0.0) (0.0, 0.0, 0.0, 0.0)]
 [(0.0, 0.0, 0.0, 0.0) (0.0, 0.0, 0.0, 0.0) (0.0, 0.0, 0.0, 0.0)
  (0.0, 0.0, 0.0, 0.0) (0.0, 0.0, 0.0, 0.0)]]

With the above lines, I created a 4x5 structured array of zeros that has three fields: temp, sal and lonlat. Each field is preallocated to store 32-bit floating point numbers, specified by the f4 parameter. Each field is a scalar by default, but they could be n-d arrays. Let's fill the structed array with some arbitrary values:

In [34]:

gridded_data['temp'] = np.random.randint(20,30,(4,5))
gridded_data['sal'] = np.random.randint(30,35,(4,5))
lon = np.linspace(0,20,4)
lat = np.linspace(10,30,5)
lon,lat = np.meshgrid(lat,lon)
gridded_data['lon'] = np.round(lon) 
gridded_data['lat'] = np.round(lat)
print gridded_data

[[(24.0, 30.0, 10.0, 0.0) (29.0, 34.0, 15.0, 0.0) (22.0, 33.0, 20.0, 0.0)
  (28.0, 34.0, 25.0, 0.0) (22.0, 33.0, 30.0, 0.0)]
 [(29.0, 31.0, 10.0, 7.0) (27.0, 30.0, 15.0, 7.0) (20.0, 30.0, 20.0, 7.0)
  (28.0, 31.0, 25.0, 7.0) (27.0, 31.0, 30.0, 7.0)]
 [(20.0, 33.0, 10.0, 13.0) (24.0, 32.0, 15.0, 13.0)
  (24.0, 33.0, 20.0, 13.0) (20.0, 34.0, 25.0, 13.0)
  (26.0, 33.0, 30.0, 13.0)]
 [(20.0, 33.0, 10.0, 20.0) (29.0, 30.0, 15.0, 20.0)
  (27.0, 31.0, 20.0, 20.0) (24.0, 31.0, 25.0, 20.0)
  (23.0, 30.0, 30.0, 20.0)]]

You can access a structured array either by using field names, indices or both:

In [35]:

print gridded_data['sal'] 

[[ 30.  34.  33.  34.  33.]
 [ 31.  30.  30.  31.  31.]
 [ 33.  32.  33.  34.  33.]
 [ 33.  30.  31.  31.  30.]]

In [36]:

print gridded_data['lon']

[[ 10.  15.  20.  25.  30.]
 [ 10.  15.  20.  25.  30.]
 [ 10.  15.  20.  25.  30.]
 [ 10.  15.  20.  25.  30.]]

In [37]:

print gridded_data[['temp', 'lat']]

[[(24.0, 0.0) (29.0, 0.0) (22.0, 0.0) (28.0, 0.0) (22.0, 0.0)]
 [(29.0, 7.0) (27.0, 7.0) (20.0, 7.0) (28.0, 7.0) (27.0, 7.0)]
 [(20.0, 13.0) (24.0, 13.0) (24.0, 13.0) (20.0, 13.0) (26.0, 13.0)]
 [(20.0, 20.0) (29.0, 20.0) (27.0, 20.0) (24.0, 20.0) (23.0, 20.0)]]

Also note the shape of each field:

In [38]:

gridded_data.shape

Out[38]:

(4, 5)

In [39]:

np.mean(gridded_data['temp'], axis=0)

Out[39]:

array([ 23.25,  27.25,  23.25,  25.  ,  24.5 ], dtype=float32)

In [39]:

Miscellany¶

Datetime objects¶

The datetime module provides functions and classes for working dates and times. There is the standard, built-in Python version and NumPy version called datetime64.

In [39]: