# special IPython command to prepare the notebook for matplotlib
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
pd.options.display.mpl_style = 'default'
Previously discussed:
NumPy and SciPy are modules in Python for scientific computing. NumPy lets you do fast, vectorized operations on arrays. Why use this module?
ndarray
which allows you perform vectorized operations on (and supports mathematical functions such as linear algebra and random number generation)# Import NumPy
import numpy as np
To create a fast, multidimensional ndarray
object, use the np.array()
method on a python list
or tuple
or reading data from files.
x = np.array([1,2,3,4])
y = np.array([[1,2], [3,4]])
x
array([1, 2, 3, 4])
y
array([[1, 2], [3, 4]])
type(x)
numpy.ndarray
There are a set of properties about the ndarray
object such the dimensions, the size, etc.
Property | Description |
---|---|
y.shape (or shape(y) |
Shape or dimension of the array |
y.size (or size(y) ) |
Number of elements in the array |
y.ndim |
number of dimensions |
x.shape
(4,)
y.shape
(2, 2)
Function | Description |
---|---|
np.arange(start,stop,step) |
Create a range between the start and stop arguments |
np.linspace(start,stop,num) |
Create a range between start and stop (both ends included) of length num |
np.logspace(start, stop,num,base) |
Create a range in the log space with a define base of length num |
np.eye(n) |
Generate an n x n identity matrix |
np.arange(0, 21, 2)
array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20])
# Try it: Create a numpy array from 0 to 20 in steps of size 2
# Try it: Create a numpy array from -10 to 10 in steps of 0.5 (INCLUDING the number 10)
# Try it: Create a numpy array from 100 to 1000 of length 10
In addition, the numpy.random
module can be used to create arrays using a random number generation
from numpy import random
Function | Description |
---|---|
np.random.randint(a, b, N) |
Generate N random integers between a and b |
np.random.rand(n, m) |
Generate uniform random numbers in [0,1] of dim n x m |
np.random.randn(n, m) |
Generate standard normal random numbers of dim n x m |
np.random.randint(1, 100, 50)
array([87, 29, 97, 35, 66, 84, 50, 93, 1, 10, 83, 56, 20, 49, 41, 58, 43, 60, 46, 98, 47, 91, 73, 73, 90, 26, 18, 3, 62, 65, 27, 58, 19, 49, 13, 5, 14, 16, 48, 38, 90, 19, 85, 61, 36, 38, 64, 6, 9, 97])
# Try it: Create a numpy array filled with random samples
# from a normal distribution of size 4 x 4
To reshape an array, use reshape()
:
z = np.random.rand(4,4)
z
array([[ 0.34961451, 0.75618943, 0.85774252, 0.29423465], [ 0.72196235, 0.02541357, 0.7708488 , 0.07240782], [ 0.54376752, 0.41193452, 0.40132359, 0.63399867], [ 0.12622657, 0.34662246, 0.27813886, 0.95162428]])
z.shape
(4, 4)
z.reshape((8,2)) # dim is now 8 x 2
array([[ 0.34961451, 0.75618943], [ 0.85774252, 0.29423465], [ 0.72196235, 0.02541357], [ 0.7708488 , 0.07240782], [ 0.54376752, 0.41193452], [ 0.40132359, 0.63399867], [ 0.12622657, 0.34662246], [ 0.27813886, 0.95162428]])
To flatten an array (convert a higher dimensional array into a vector), use flatten()
z.flatten()
array([ 0.34961451, 0.75618943, 0.85774252, 0.29423465, 0.72196235, 0.02541357, 0.7708488 , 0.07240782, 0.54376752, 0.41193452, 0.40132359, 0.63399867, 0.12622657, 0.34662246, 0.27813886, 0.95162428])
y = np.array([[1,2], [3,4]])
y.shape
(2, 2)
y[0,0] = 10
y
array([[10, 2], [ 3, 4]])
To extract elements of the NumPy arrays, use the bracket operator and the slice (i.e. colon) operator. To slice specific elements in the array, use dat[lower:upper:step]
. To extract the diagonal (and subdiagonal) elements, use diag()
.
# random samples from a uniform distribution between 0 and 1
dat = np.random.rand(4,4)
dat
array([[ 0.60679169, 0.36100824, 0.18275644, 0.56561955], [ 0.36584042, 0.12087577, 0.14576369, 0.21879333], [ 0.27301492, 0.64171746, 0.62002836, 0.83744579], [ 0.30159074, 0.71813527, 0.94443425, 0.19098029]])
dat[0, :] # row 1
array([ 0.60679169, 0.36100824, 0.18275644, 0.56561955])
dat[:, 0] # column 1
array([ 0.60679169, 0.36584042, 0.27301492, 0.30159074])
dat[0:3:2, 0] # first and third elements in column 1
array([ 0.60679169, 0.27301492])
np.diag(dat) # diagonal
array([ 0.60679169, 0.12087577, 0.62002836, 0.19098029])
np.arange(32).reshape((8, 4)) # returns an 8 x 4 array
array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23], [24, 25, 26, 27], [28, 29, 30, 31]])
x[0] # returns the first row
1
There are many vectorized wrappers that take in one scalar and produce one ore more scalars (e.g. np.exp()
, np.sqrt()
). This element-wise array methods are also known as NumPy ufuncs
.
Function | Description |
---|---|
np.abs(x) |
absolute value of each element |
np.sqrt(x) |
square root of each element |
np.square(x) |
square of each element |
np.exp(x) |
exponential of each element |
np.maximum(x, y) |
element-wise maximum from two arrays x and y |
np.minimum(x,y) |
element-wise minimum |
np.sign(x) |
compute the sign of each element: 1 (pos), 0 (zero), -1 (neg) |
np.subtract(x, y) |
subtract elements in y from elements in x |
np.power(x, y) |
raise elements in first array x to powers in second array y |
np.where(cond, x, y) |
ifelse statement |
It is important to state again that you should avoid looping through elements in vectors if at all possible. One way to get around that when writing functions is to use what are called vectorized functions. Say you wrote a function f
which accepts some input x
and checks if x
is bigger or smaller than 0.
def f(x):
if x >=0:
return True
else:
return False
print f(3)
True
If we give the function an array instead of just one value (e.g. 3), then Python will give an error because there is more than one element in x
. The way to get around this is to vectorize the function.
f_vec = np.vectorize(f)
z = np.arange(-5, 6)
z
array([-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5])
f_vec(z)
array([False, False, False, False, False, True, True, True, True, True, True], dtype=bool)
Instead of vectorizing the function, you can also make the function itself aware that it will be accepting vectors from the beginning.
def f(x):
return (x >=0)
print f(3)
True
Now that you know a little bit about NumPy and SciPy is a collection of mathematical and scientific modules built on top of NumPy. For example, SciPy can handle multidimensional arrays, integration, linear algebra, statistics and optimization.
# Import SciPy
import scipy
SciPy includes most of NumPy, so importing SciPy should be generally OK. The main SciPy module is made up of many submodules containing specialized topics.
Favorite SciPy submodules | What does it contain? |
---|---|
scipy.stats |
statistics: random variables, probability density functions, cumulative distribution functions, survival functions |
scipy.integrate |
integration: single, double, triple integration, trapezoidal rule, Simpson's rule, differential equation solvers |
scipy.signal |
signal processing tools: signal processing tools such as wavelets, spectral densities, filters, B-splines |
scipy.optimize |
optimization: find roots, curve fitting, least squares, etc |
scipy.special |
special functions: very specialized functions in mathematical physics e.g. bessel, gamma |
scipy.linalg |
linear algebra: inverse of a matrix, determinant, Kronecker product, eigenvalue decomposition, SVD, functions for matrices (beyond those in numpy.linalg ) |
If you want to import a SciPy submodule (e.g. the statistics submodule scipy.stats
), use
from scipy import stats
Let's dive a bit deeper in scipy.stats
. The real utility of this submodule is to access probability distributions functions (pdfs) and standard statistical tests (e.g. $t$-test).
There is a large collection of continuous and discrete pdfs in the scipy.stats
submodule. The syntax to simulate random variables from a specific pdf is the name of the distribution followed by .rvs
. To generate $n$=10 $N(0,1)$ random variables,
from scipy.stats import norm
x = norm.rvs(loc = 0, scale = 1, size = 1000)
plt.hist(x)
plt.title('Histogram of 1000 normal random variables')
<matplotlib.text.Text at 0x1089af590>
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). This dataset is available on Github in the 2014_data repository and is called mtcars.csv
.
This is a .csv
file, so we will use the function read_csv()
that will read in a CSV file into a pandas DataFrame.
url = 'https://raw.githubusercontent.com/cs109/2014_data/master/mtcars.csv'
mtcars = pd.read_csv(url, sep = ',', index_col=0)
mtcars.head()
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
# DataFrame with 32 observations on 11 variables
mtcars.shape
(32, 11)
# return the column names
mtcars.columns
Index([u'mpg', u'cyl', u'disp', u'hp', u'drat', u'wt', u'qsec', u'vs', u'am', u'gear', u'carb'], dtype='object')
Here is a table containing a description of all the column names.
Column name | Description |
---|---|
mpg | Miles/(US) gallon |
cyl | Number of cylinders |
disp | Displacement (cu.in.) |
hp | Gross horsepower |
drat | Rear axle ratio |
wt | Weight (lb/1000) |
qsec | 1/4 mile time |
vs | V/S |
am | Transmission (0 = automatic, 1 = manual) |
gear | Number of forward gears |
carb | Number of carburetors |
# return the actual data inside the panadas data frame
mtcars.values
array([[ 21. , 6. , 160. , 110. , 3.9 , 2.62 , 16.46 , 0. , 1. , 4. , 4. ], [ 21. , 6. , 160. , 110. , 3.9 , 2.875, 17.02 , 0. , 1. , 4. , 4. ], [ 22.8 , 4. , 108. , 93. , 3.85 , 2.32 , 18.61 , 1. , 1. , 4. , 1. ], [ 21.4 , 6. , 258. , 110. , 3.08 , 3.215, 19.44 , 1. , 0. , 3. , 1. ], [ 18.7 , 8. , 360. , 175. , 3.15 , 3.44 , 17.02 , 0. , 0. , 3. , 2. ], [ 18.1 , 6. , 225. , 105. , 2.76 , 3.46 , 20.22 , 1. , 0. , 3. , 1. ], [ 14.3 , 8. , 360. , 245. , 3.21 , 3.57 , 15.84 , 0. , 0. , 3. , 4. ], [ 24.4 , 4. , 146.7 , 62. , 3.69 , 3.19 , 20. , 1. , 0. , 4. , 2. ], [ 22.8 , 4. , 140.8 , 95. , 3.92 , 3.15 , 22.9 , 1. , 0. , 4. , 2. ], [ 19.2 , 6. , 167.6 , 123. , 3.92 , 3.44 , 18.3 , 1. , 0. , 4. , 4. ], [ 17.8 , 6. , 167.6 , 123. , 3.92 , 3.44 , 18.9 , 1. , 0. , 4. , 4. ], [ 16.4 , 8. , 275.8 , 180. , 3.07 , 4.07 , 17.4 , 0. , 0. , 3. , 3. ], [ 17.3 , 8. , 275.8 , 180. , 3.07 , 3.73 , 17.6 , 0. , 0. , 3. , 3. ], [ 15.2 , 8. , 275.8 , 180. , 3.07 , 3.78 , 18. , 0. , 0. , 3. , 3. ], [ 10.4 , 8. , 472. , 205. , 2.93 , 5.25 , 17.98 , 0. , 0. , 3. , 4. ], [ 10.4 , 8. , 460. , 215. , 3. , 5.424, 17.82 , 0. , 0. , 3. , 4. ], [ 14.7 , 8. , 440. , 230. , 3.23 , 5.345, 17.42 , 0. , 0. , 3. , 4. ], [ 32.4 , 4. , 78.7 , 66. , 4.08 , 2.2 , 19.47 , 1. , 1. , 4. , 1. ], [ 30.4 , 4. , 75.7 , 52. , 4.93 , 1.615, 18.52 , 1. , 1. , 4. , 2. ], [ 33.9 , 4. , 71.1 , 65. , 4.22 , 1.835, 19.9 , 1. , 1. , 4. , 1. ], [ 21.5 , 4. , 120.1 , 97. , 3.7 , 2.465, 20.01 , 1. , 0. , 3. , 1. ], [ 15.5 , 8. , 318. , 150. , 2.76 , 3.52 , 16.87 , 0. , 0. , 3. , 2. ], [ 15.2 , 8. , 304. , 150. , 3.15 , 3.435, 17.3 , 0. , 0. , 3. , 2. ], [ 13.3 , 8. , 350. , 245. , 3.73 , 3.84 , 15.41 , 0. , 0. , 3. , 4. ], [ 19.2 , 8. , 400. , 175. , 3.08 , 3.845, 17.05 , 0. , 0. , 3. , 2. ], [ 27.3 , 4. , 79. , 66. , 4.08 , 1.935, 18.9 , 1. , 1. , 4. , 1. ], [ 26. , 4. , 120.3 , 91. , 4.43 , 2.14 , 16.7 , 0. , 1. , 5. , 2. ], [ 30.4 , 4. , 95.1 , 113. , 3.77 , 1.513, 16.9 , 1. , 1. , 5. , 2. ], [ 15.8 , 8. , 351. , 264. , 4.22 , 3.17 , 14.5 , 0. , 1. , 5. , 4. ], [ 19.7 , 6. , 145. , 175. , 3.62 , 2.77 , 15.5 , 0. , 1. , 5. , 6. ], [ 15. , 8. , 301. , 335. , 3.54 , 3.57 , 14.6 , 0. , 1. , 5. , 8. ], [ 21.4 , 4. , 121. , 109. , 4.11 , 2.78 , 18.6 , 1. , 1. , 4. , 2. ]])
mtcars[25:] # rows 25 to end of data frame
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Fiat X1-9 | 27.3 | 4 | 79.0 | 66 | 4.08 | 1.935 | 18.9 | 1 | 1 | 4 | 1 |
Porsche 914-2 | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.7 | 0 | 1 | 5 | 2 |
Lotus Europa | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.9 | 1 | 1 | 5 | 2 |
Ford Pantera L | 15.8 | 8 | 351.0 | 264 | 4.22 | 3.170 | 14.5 | 0 | 1 | 5 | 4 |
Ferrari Dino | 19.7 | 6 | 145.0 | 175 | 3.62 | 2.770 | 15.5 | 0 | 1 | 5 | 6 |
Maserati Bora | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.570 | 14.6 | 0 | 1 | 5 | 8 |
Volvo 142E | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.6 | 1 | 1 | 4 | 2 |
# return index
mtcars.index
Index([u'Mazda RX4', u'Mazda RX4 Wag', u'Datsun 710', u'Hornet 4 Drive', u'Hornet Sportabout', u'Valiant', u'Duster 360', u'Merc 240D', u'Merc 230', u'Merc 280', u'Merc 280C', u'Merc 450SE', u'Merc 450SL', u'Merc 450SLC', u'Cadillac Fleetwood', u'Lincoln Continental', u'Chrysler Imperial', u'Fiat 128', u'Honda Civic', u'Toyota Corolla', u'Toyota Corona', u'Dodge Challenger', u'AMC Javelin', u'Camaro Z28', u'Pontiac Firebird', u'Fiat X1-9', u'Porsche 914-2', u'Lotus Europa', u'Ford Pantera L', u'Ferrari Dino', u'Maserati Bora', u'Volvo 142E'], dtype='object')
mtcars.ix['Maserati Bora'] # access a row by an index
mpg 15.00 cyl 8.00 disp 301.00 hp 335.00 drat 3.54 wt 3.57 qsec 14.60 vs 0.00 am 1.00 gear 5.00 carb 8.00 Name: Maserati Bora, dtype: float64
# What other methods are available when working with pandas DataFrames?
# type 'mtcars.' and then click <TAB>
# mtcars.<TAB>
# try it here
Even though they may look like continuous variabes, cyl
, vs
, am
, gear
and carb
are integer or categorical variables. First, let's look at some summary statistics of the mtcars data set.
mtcars.describe()
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 32.000000 | 32.000000 | 32.000000 | 32.000000 | 32.000000 | 32.000000 | 32.000000 | 32.000000 | 32.000000 | 32.000000 | 32.0000 |
mean | 20.090625 | 6.187500 | 230.721875 | 146.687500 | 3.596563 | 3.217250 | 17.848750 | 0.437500 | 0.406250 | 3.687500 | 2.8125 |
std | 6.026948 | 1.785922 | 123.938694 | 68.562868 | 0.534679 | 0.978457 | 1.786943 | 0.504016 | 0.498991 | 0.737804 | 1.6152 |
min | 10.400000 | 4.000000 | 71.100000 | 52.000000 | 2.760000 | 1.513000 | 14.500000 | 0.000000 | 0.000000 | 3.000000 | 1.0000 |
25% | 15.425000 | 4.000000 | 120.825000 | 96.500000 | 3.080000 | 2.581250 | 16.892500 | 0.000000 | 0.000000 | 3.000000 | 2.0000 |
50% | 19.200000 | 6.000000 | 196.300000 | 123.000000 | 3.695000 | 3.325000 | 17.710000 | 0.000000 | 0.000000 | 4.000000 | 2.0000 |
75% | 22.800000 | 8.000000 | 326.000000 | 180.000000 | 3.920000 | 3.610000 | 18.900000 | 1.000000 | 1.000000 | 4.000000 | 4.0000 |
max | 33.900000 | 8.000000 | 472.000000 | 335.000000 | 4.930000 | 5.424000 | 22.900000 | 1.000000 | 1.000000 | 5.000000 | 8.0000 |
To check if any
or all
elements in an array meet a certain criteria, use any()
and all()
.
(mtcars.mpg >= 20).any()
True
(mtcars > 0).all()
mpg True cyl True disp True hp True drat True wt True qsec True vs False am False gear True carb True dtype: bool
Let's look at the distribution of mpg
using a histogram.
mtcars['mpg'].hist()
plt.title('Distribution of MPG')
plt.xlabel('Miles Per Gallon')
<matplotlib.text.Text at 0x108a22550>
# Relationship between cyl and mpg
plt.plot(mtcars.cyl, mtcars.mpg, 'o')
plt.xlim(3, 9)
plt.xlabel('Cylinders')
plt.ylabel('MPG')
plt.title('Relationship between cylinders and MPG')
<matplotlib.text.Text at 0x10969b5d0>
# Relationship between horsepower and mpg
plt.plot(mtcars.hp, mtcars.mpg, 'o')
plt.xlabel('Horsepower')
plt.ylabel('MPG')
plt.title('Relationship between horsepower and MPG')
<matplotlib.text.Text at 0x1097bc150>
from pandas.tools.plotting import scatter_matrix
scatter_matrix(mtcars[['mpg', 'hp', 'cyl']],
figsize = (10, 6), alpha = 1, diagonal='kde')
array([[<matplotlib.axes.AxesSubplot object at 0x109811350>, <matplotlib.axes.AxesSubplot object at 0x10990a910>, <matplotlib.axes.AxesSubplot object at 0x1099899d0>], [<matplotlib.axes.AxesSubplot object at 0x1099eacd0>, <matplotlib.axes.AxesSubplot object at 0x109c6cc90>, <matplotlib.axes.AxesSubplot object at 0x109cd0b90>], [<matplotlib.axes.AxesSubplot object at 0x109d4dc50>, <matplotlib.axes.AxesSubplot object at 0x109d89d90>, <matplotlib.axes.AxesSubplot object at 0x109f3ca50>]], dtype=object)
Now we will discuss working on the command line. For this section and the next section on git and GitHub we will use slides from the Data Science Specialization course on Coursera. These slides are available from
Next we introduce git and GitHub. For this section we will also use slides from Data Science Specialization course on Coursera. These slides are available from
Other useful resources for learning git and github:
git clone
to clone the CS109 2014 course repository on Githubgit clone
to clone the CS109 2014 data repository on Github