Nikolay Koldunov
koldunovn@gmail.com
import numpy as np
%matplotlib inline
np.set_printoptions(precision=3 , suppress= True) # this is just to make the output look better
We going to work with data from GHCN (Global Historical Climatology Network)-Daily data.
Convinient way to select data from there is to use KNMI Climatological Service
Load data in to a variable (Delhi daily air temperatures):
ls
05_numpy.ipynb anatomyarray.png DelhiTmax.txt temp_only_values.csv
temp = np.loadtxt('DelhiTmax.txt')
We load data in to the spetiall variable called numpy array
. This is homogeneous multidimensional array. It is a table of elements (usually numbers), all of the same type. Numpy arrays are basic elements of almost all python based scientific software.
type(temp)
numpy.ndarray
temp
array([[ 1944. , 1. , 1. , 22.2], [ 1944. , 1. , 3. , 23.9], [ 1944. , 1. , 4. , 22.2], ..., [ 2015. , 2. , 26. , 27.6], [ 2015. , 2. , 27. , 27.6], [ 2015. , 2. , 28. , 29.2]])
Shape of the array can be viewed as a size of the table
that contain data:
temp.shape
(15436, 4)
However this tables
can have 3 and more dimentions.
So it's a row-major order. Matlab and Fortran use column-major order for arrays.
Numpy arrays are statically typed, which allow faster operations
temp.dtype
dtype('float64')
You can't assign value of different type to element of the numpy array:
temp[0,0] = 'Year'
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-9-140f5f32ef74> in <module>() ----> 1 temp[0,0] = 'Year' ValueError: could not convert string to float: Year
Slicing works similarly to Matlab:
temp[0:5,:]
array([[ 1944. , 1. , 1. , 22.2], [ 1944. , 1. , 3. , 23.9], [ 1944. , 1. , 4. , 22.2], [ 1944. , 1. , 6. , 22.8], [ 1944. , 1. , 7. , 22.2]])
temp[-5:-1,:]
array([[ 2015. , 2. , 24. , 28.6], [ 2015. , 2. , 25. , 28.6], [ 2015. , 2. , 26. , 27.6], [ 2015. , 2. , 27. , 27.6]])
One can look at the data. This is done by matplotlib module:
import matplotlib.pylab as plt
plt.plot(temp[:,3])
[<matplotlib.lines.Line2D at 0x7ff67513a590>]
In general it is similar to Matlab
First 12 elements of second column (months). Remember that indexing starts with 0:
temp[0:12,1]
array([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
First raw:
temp[:10,:]
array([[ 1944. , 1. , 1. , 22.2], [ 1944. , 1. , 3. , 23.9], [ 1944. , 1. , 4. , 22.2], [ 1944. , 1. , 6. , 22.8], [ 1944. , 1. , 7. , 22.2], [ 1944. , 1. , 8. , 16.7], [ 1944. , 1. , 9. , 20.6], [ 1944. , 1. , 11. , 21.1], [ 1944. , 1. , 15. , 20. ], [ 1944. , 1. , 19. , 24.4]])
We can create mask, selecting all raws where values in third raw (days) equals 10:
mask = (temp[:,2]==10)
Here we apply this mask and show only first 5 raws of the array:
temp[mask][:20,:]
array([[ 1944. , 3. , 10. , 31.7], [ 1944. , 4. , 10. , 35.6], [ 1944. , 6. , 10. , 42.2], [ 1944. , 7. , 10. , 30.6], [ 1944. , 10. , 10. , 30.6], [ 1944. , 11. , 10. , 31.7], [ 1944. , 12. , 10. , 26.1], [ 1945. , 2. , 10. , 22.8], [ 1945. , 3. , 10. , 29.4], [ 1945. , 4. , 10. , 32.2], [ 1945. , 6. , 10. , 45. ], [ 1945. , 7. , 10. , 38.3], [ 1945. , 8. , 10. , 40. ], [ 1945. , 10. , 10. , 33.9], [ 1945. , 11. , 10. , 30. ], [ 1946. , 1. , 10. , 21.7], [ 1946. , 2. , 10. , 27.8], [ 1957. , 1. , 10. , 15. ], [ 1957. , 8. , 10. , 36.1], [ 1957. , 9. , 10. , 33.9]])
You don't have to create separate variable for mask, but apply it directly. Here instead of first five rows I show five last rows:
temp[temp[:,2]==10][-5:,:]
array([[ 2014. , 10. , 10. , 34.6], [ 2014. , 11. , 10. , 29.5], [ 2014. , 12. , 10. , 28.9], [ 2015. , 1. , 10. , 18.3], [ 2015. , 2. , 10. , 24.3]])
You can combine conditions. In this case we select days from 10 to 12 (only first 10 elements are shown):
temp[(temp[:,2]>=10)&(temp[:,2]<=12)][0:10,:]
array([[ 1944. , 1. , 11. , 21.1], [ 1944. , 2. , 12. , 23.9], [ 1944. , 3. , 10. , 31.7], [ 1944. , 3. , 11. , 32.2], [ 1944. , 3. , 12. , 30.6], [ 1944. , 4. , 10. , 35.6], [ 1944. , 4. , 12. , 36.1], [ 1944. , 5. , 11. , 41.7], [ 1944. , 5. , 12. , 40.6], [ 1944. , 6. , 10. , 42.2]])
Select only summer months
Select only first half of the year
Create example array from first 12 values of second column and perform some basic operations:
days = temp[0:12,2]
days
array([ 1., 3., 4., 6., 7., 8., 9., 11., 15., 19., 20., 21.])
days+10
array([ 11., 13., 14., 16., 17., 18., 19., 21., 25., 29., 30., 31.])
days*20
array([ 20., 60., 80., 120., 140., 160., 180., 220., 300., 380., 400., 420.])
days*days
array([ 1., 9., 16., 36., 49., 64., 81., 121., 225., 361., 400., 441.])
np.sin(days)
array([ 0.841, 0.141, -0.757, -0.279, 0.657, 0.989, 0.412, -1. , 0.65 , 0.15 , 0.913, 0.837])
Create new array that will contain only temperatures
Convert all temperatures to deg F
Create temp_values that will contain only data values:
temp_values = temp[:,3]
temp_values
array([ 22.2, 23.9, 22.2, ..., 27.6, 27.6, 29.2])
Simple statistics:
temp_values.min()
9.8000000000000007
temp_values.max()
47.899999999999999
temp_values.mean()
31.402131381186834
temp_values.std()
6.7398926653583944
temp_values.sum()
484723.29999999999
You can also use sum function:
np.sum(temp_values)
484723.29999999999
One can make operations on the subsets:
Calculate mean for first 1000 values of temperature
You can save your data as a text file
np.savetxt('temp_only_values.csv',temp[:, 3], fmt='%.4f')