In [1]:

# special IPython command to prepare the notebook for matplotlib
%matplotlib inline 

import pandas as pd
import matplotlib.pyplot as plt

pd.options.display.mpl_style = 'default'

Recall from from lab last week 09/12/2014¶

Previously discussed:

Reading in a CSV file into a pandas DataFrame
Using histograms, scatterplots and boxplots as exploratory data analysis
Summary statistics
Functions to access a pandas DataFrame
Defining your own functions and using loops

Today, we will discuss the following:¶

Brief introduction to Numpy, Scipy
- Vectorizing functions
More pandas and matplotlib
Working in the command line
Overview of git and Github

Download this notebook from Github

Numpy¶

NumPy and SciPy are modules in Python for scientific computing. NumPy lets you do fast, vectorized operations on arrays. Why use this module?

It gives you the performance of using low-level code (e.g. C or Fortran) with the benefit of writing the code in an interpreted scripting language (all while keeping the native Python code).
It gives you a fast, memory-efficient multidimensional array called ndarray which allows you perform vectorized operations on (and supports mathematical functions such as linear algebra and random number generation)

In [2]:

# Import NumPy
import numpy as np

To create a fast, multidimensional ndarray object, use the np.array() method on a python list or tuple or reading data from files.

In [3]:

x = np.array([1,2,3,4])
y = np.array([[1,2], [3,4]])
x

Out[3]:

array([1, 2, 3, 4])

In [4]:

Out[4]:

array([[1, 2],
       [3, 4]])

In [5]:

type(x)

Out[5]:

numpy.ndarray

Properties of NumPy arrays¶

There are a set of properties about the ndarray object such the dimensions, the size, etc.

Property	Description
`y.shape` (or `shape(y)`	Shape or dimension of the array
`y.size` (or `size(y)`)	Number of elements in the array
`y.ndim`	number of dimensions

In [6]:

x.shape

Out[6]:

(4,)

In [7]:

y.shape

Out[7]:

(2, 2)

Other ways to generate NumPy arrays¶

Function	Description
`np.arange(start,stop,step)`	Create a range between the start and stop arguments
`np.linspace(start,stop,num)`	Create a range between start and stop (both ends included) of length num
`np.logspace(start, stop,num,base)`	Create a range in the log space with a define base of length num
`np.eye(n)`	Generate an n x n identity matrix

In [8]:

np.arange(0, 21, 2)

Out[8]:

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18, 20])

In [9]:

# Try it: Create a numpy array from 0 to 20 in steps of size 2

In [10]:

# Try it: Create a numpy array from -10 to 10 in steps of 0.5 (INCLUDING the number 10)

In [11]:

# Try it: Create a numpy array from 100 to 1000 of length 10

In addition, the numpy.random module can be used to create arrays using a random number generation

In [12]:

from numpy import random

Function	Description
`np.random.randint(a, b, N)`	Generate N random integers between a and b
`np.random.rand(n, m)`	Generate uniform random numbers in [0,1] of dim n x m
`np.random.randn(n, m)`	Generate standard normal random numbers of dim n x m

In [13]:

np.random.randint(1, 100, 50)

Out[13]:

array([87, 29, 97, 35, 66, 84, 50, 93,  1, 10, 83, 56, 20, 49, 41, 58, 43,
       60, 46, 98, 47, 91, 73, 73, 90, 26, 18,  3, 62, 65, 27, 58, 19, 49,
       13,  5, 14, 16, 48, 38, 90, 19, 85, 61, 36, 38, 64,  6,  9, 97])

In [14]:

# Try it: Create a numpy array filled with random samples 
# from a normal distribution of size 4 x 4

Reshaping, resizing and stacking NumPy arrays¶

To reshape an array, use reshape():

In [15]:

z = np.random.rand(4,4)
z 

Out[15]:

array([[ 0.34961451,  0.75618943,  0.85774252,  0.29423465],
       [ 0.72196235,  0.02541357,  0.7708488 ,  0.07240782],
       [ 0.54376752,  0.41193452,  0.40132359,  0.63399867],
       [ 0.12622657,  0.34662246,  0.27813886,  0.95162428]])

In [16]:

z.shape

Out[16]:

(4, 4)

In [17]:

z.reshape((8,2)) # dim is now 8 x 2

Out[17]:

array([[ 0.34961451,  0.75618943],
       [ 0.85774252,  0.29423465],
       [ 0.72196235,  0.02541357],
       [ 0.7708488 ,  0.07240782],
       [ 0.54376752,  0.41193452],
       [ 0.40132359,  0.63399867],
       [ 0.12622657,  0.34662246],
       [ 0.27813886,  0.95162428]])

To flatten an array (convert a higher dimensional array into a vector), use flatten()

In [18]:

z.flatten()

Out[18]:

array([ 0.34961451,  0.75618943,  0.85774252,  0.29423465,  0.72196235,
        0.02541357,  0.7708488 ,  0.07240782,  0.54376752,  0.41193452,
        0.40132359,  0.63399867,  0.12622657,  0.34662246,  0.27813886,
        0.95162428])

Operating on NumPy arrays¶

Assigning values¶

To assign values to a specific element in a ndarray, use the assignment operator.

In [19]:

y = np.array([[1,2], [3,4]])
y.shape

Out[19]:

(2, 2)

In [20]:

y[0,0] = 10
y 

Out[20]:

array([[10,  2],
       [ 3,  4]])

Indexing and slicing arrays¶

To extract elements of the NumPy arrays, use the bracket operator and the slice (i.e. colon) operator. To slice specific elements in the array, use dat[lower:upper:step]. To extract the diagonal (and subdiagonal) elements, use diag().

In [21]:

 # random samples from a uniform distribution between 0 and 1
dat = np.random.rand(4,4)
dat

Out[21]:

array([[ 0.60679169,  0.36100824,  0.18275644,  0.56561955],
       [ 0.36584042,  0.12087577,  0.14576369,  0.21879333],
       [ 0.27301492,  0.64171746,  0.62002836,  0.83744579],
       [ 0.30159074,  0.71813527,  0.94443425,  0.19098029]])

In [22]:

dat[0, :] # row 1

Out[22]:

array([ 0.60679169,  0.36100824,  0.18275644,  0.56561955])

In [23]:

dat[:, 0] # column 1

Out[23]:

array([ 0.60679169,  0.36584042,  0.27301492,  0.30159074])

In [24]:

dat[0:3:2, 0] # first and third elements in column 1

Out[24]:

array([ 0.60679169,  0.27301492])

In [25]:

np.diag(dat) # diagonal

Out[25]:

array([ 0.60679169,  0.12087577,  0.62002836,  0.19098029])

In [26]:

np.arange(32).reshape((8, 4)) # returns an 8 x 4 array

Out[26]:

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31]])

In [27]:

x[0] # returns the first row

Out[27]:

Element-wise transformations on arrays¶

There are many vectorized wrappers that take in one scalar and produce one ore more scalars (e.g. np.exp(), np.sqrt()). This element-wise array methods are also known as NumPy ufuncs.

Function	Description
`np.abs(x)`	absolute value of each element
`np.sqrt(x)`	square root of each element
`np.square(x)`	square of each element
`np.exp(x)`	exponential of each element
`np.maximum(x, y)`	element-wise maximum from two arrays x and y
`np.minimum(x,y)`	element-wise minimum
`np.sign(x)`	compute the sign of each element: 1 (pos), 0 (zero), -1 (neg)
`np.subtract(x, y)`	subtract elements in y from elements in x
`np.power(x, y)`	raise elements in first array x to powers in second array y
`np.where(cond, x, y)`	ifelse statement

Vectorizing functions¶

It is important to state again that you should avoid looping through elements in vectors if at all possible. One way to get around that when writing functions is to use what are called vectorized functions. Say you wrote a function f which accepts some input x and checks if x is bigger or smaller than 0.

In [28]:

def f(x):
    if x >=0:
        return True
    else:
        return False

print f(3)

True

If we give the function an array instead of just one value (e.g. 3), then Python will give an error because there is more than one element in x. The way to get around this is to vectorize the function.

In [29]:

f_vec = np.vectorize(f)
z = np.arange(-5, 6)
z 

Out[29]:

array([-5, -4, -3, -2, -1,  0,  1,  2,  3,  4,  5])

In [30]:

f_vec(z)

Out[30]:

array([False, False, False, False, False,  True,  True,  True,  True,
        True,  True], dtype=bool)

Instead of vectorizing the function, you can also make the function itself aware that it will be accepting vectors from the beginning.

In [31]:

def f(x):
    return (x >=0)

print f(3)

True

Scipy¶

Now that you know a little bit about NumPy and SciPy is a collection of mathematical and scientific modules built on top of NumPy. For example, SciPy can handle multidimensional arrays, integration, linear algebra, statistics and optimization.

In [32]:

# Import SciPy
import scipy

SciPy includes most of NumPy, so importing SciPy should be generally OK. The main SciPy module is made up of many submodules containing specialized topics.

Favorite SciPy submodules	What does it contain?
`scipy.stats`	statistics: random variables, probability density functions, cumulative distribution functions, survival functions
`scipy.integrate`	integration: single, double, triple integration, trapezoidal rule, Simpson's rule, differential equation solvers
`scipy.signal`	signal processing tools: signal processing tools such as wavelets, spectral densities, filters, B-splines
`scipy.optimize`	optimization: find roots, curve fitting, least squares, etc
`scipy.special`	special functions: very specialized functions in mathematical physics e.g. bessel, gamma
`scipy.linalg`	linear algebra: inverse of a matrix, determinant, Kronecker product, eigenvalue decomposition, SVD, functions for matrices (beyond those in `numpy.linalg`)

If you want to import a SciPy submodule (e.g. the statistics submodule scipy.stats), use

In [33]:

from scipy import stats

scipy.stats¶

Let's dive a bit deeper in scipy.stats. The real utility of this submodule is to access probability distributions functions (pdfs) and standard statistical tests (e.g. $t$-test).

Probability distribution functions¶

There is a large collection of continuous and discrete pdfs in the scipy.stats submodule. The syntax to simulate random variables from a specific pdf is the name of the distribution followed by .rvs. To generate $n$=10 $N(0,1)$ random variables,

In [34]:

from scipy.stats import norm
x = norm.rvs(loc = 0, scale = 1, size = 1000)
plt.hist(x)
plt.title('Histogram of 1000 normal random variables')

Out[34]:

<matplotlib.text.Text at 0x1089af590>

More Pandas and Matplotlib¶

Motor Trend Car Road Tests Data¶

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). This dataset is available on Github in the 2014_data repository and is called mtcars.csv.

Reading in the mtcars data (CSV file) from the web¶

This is a .csv file, so we will use the function read_csv() that will read in a CSV file into a pandas DataFrame.

In [35]:

url = 'https://raw.githubusercontent.com/cs109/2014_data/master/mtcars.csv'
mtcars = pd.read_csv(url, sep = ',', index_col=0)
mtcars.head()

Out[35]:

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.440	17.02	0	0	3	2

In [36]:

# DataFrame with 32 observations on 11 variables
mtcars.shape 

Out[36]:

(32, 11)

In [37]:

# return the column names
mtcars.columns

Out[37]:

Index([u'mpg', u'cyl', u'disp', u'hp', u'drat', u'wt', u'qsec', u'vs', u'am', u'gear', u'carb'], dtype='object')

Here is a table containing a description of all the column names.

Column name	Description
mpg	Miles/(US) gallon
cyl	Number of cylinders
disp	Displacement (cu.in.)
hp	Gross horsepower
drat	Rear axle ratio
wt	Weight (lb/1000)
qsec	1/4 mile time
vs	V/S
am	Transmission (0 = automatic, 1 = manual)
gear	Number of forward gears
carb	Number of carburetors

In [38]:

# return the actual data inside the panadas data frame
mtcars.values

Out[38]:

array([[  21.   ,    6.   ,  160.   ,  110.   ,    3.9  ,    2.62 ,
          16.46 ,    0.   ,    1.   ,    4.   ,    4.   ],
       [  21.   ,    6.   ,  160.   ,  110.   ,    3.9  ,    2.875,
          17.02 ,    0.   ,    1.   ,    4.   ,    4.   ],
       [  22.8  ,    4.   ,  108.   ,   93.   ,    3.85 ,    2.32 ,
          18.61 ,    1.   ,    1.   ,    4.   ,    1.   ],
       [  21.4  ,    6.   ,  258.   ,  110.   ,    3.08 ,    3.215,
          19.44 ,    1.   ,    0.   ,    3.   ,    1.   ],
       [  18.7  ,    8.   ,  360.   ,  175.   ,    3.15 ,    3.44 ,
          17.02 ,    0.   ,    0.   ,    3.   ,    2.   ],
       [  18.1  ,    6.   ,  225.   ,  105.   ,    2.76 ,    3.46 ,
          20.22 ,    1.   ,    0.   ,    3.   ,    1.   ],
       [  14.3  ,    8.   ,  360.   ,  245.   ,    3.21 ,    3.57 ,
          15.84 ,    0.   ,    0.   ,    3.   ,    4.   ],
       [  24.4  ,    4.   ,  146.7  ,   62.   ,    3.69 ,    3.19 ,
          20.   ,    1.   ,    0.   ,    4.   ,    2.   ],
       [  22.8  ,    4.   ,  140.8  ,   95.   ,    3.92 ,    3.15 ,
          22.9  ,    1.   ,    0.   ,    4.   ,    2.   ],
       [  19.2  ,    6.   ,  167.6  ,  123.   ,    3.92 ,    3.44 ,
          18.3  ,    1.   ,    0.   ,    4.   ,    4.   ],
       [  17.8  ,    6.   ,  167.6  ,  123.   ,    3.92 ,    3.44 ,
          18.9  ,    1.   ,    0.   ,    4.   ,    4.   ],
       [  16.4  ,    8.   ,  275.8  ,  180.   ,    3.07 ,    4.07 ,
          17.4  ,    0.   ,    0.   ,    3.   ,    3.   ],
       [  17.3  ,    8.   ,  275.8  ,  180.   ,    3.07 ,    3.73 ,
          17.6  ,    0.   ,    0.   ,    3.   ,    3.   ],
       [  15.2  ,    8.   ,  275.8  ,  180.   ,    3.07 ,    3.78 ,
          18.   ,    0.   ,    0.   ,    3.   ,    3.   ],
       [  10.4  ,    8.   ,  472.   ,  205.   ,    2.93 ,    5.25 ,
          17.98 ,    0.   ,    0.   ,    3.   ,    4.   ],
       [  10.4  ,    8.   ,  460.   ,  215.   ,    3.   ,    5.424,
          17.82 ,    0.   ,    0.   ,    3.   ,    4.   ],
       [  14.7  ,    8.   ,  440.   ,  230.   ,    3.23 ,    5.345,
          17.42 ,    0.   ,    0.   ,    3.   ,    4.   ],
       [  32.4  ,    4.   ,   78.7  ,   66.   ,    4.08 ,    2.2  ,
          19.47 ,    1.   ,    1.   ,    4.   ,    1.   ],
       [  30.4  ,    4.   ,   75.7  ,   52.   ,    4.93 ,    1.615,
          18.52 ,    1.   ,    1.   ,    4.   ,    2.   ],
       [  33.9  ,    4.   ,   71.1  ,   65.   ,    4.22 ,    1.835,
          19.9  ,    1.   ,    1.   ,    4.   ,    1.   ],
       [  21.5  ,    4.   ,  120.1  ,   97.   ,    3.7  ,    2.465,
          20.01 ,    1.   ,    0.   ,    3.   ,    1.   ],
       [  15.5  ,    8.   ,  318.   ,  150.   ,    2.76 ,    3.52 ,
          16.87 ,    0.   ,    0.   ,    3.   ,    2.   ],
       [  15.2  ,    8.   ,  304.   ,  150.   ,    3.15 ,    3.435,
          17.3  ,    0.   ,    0.   ,    3.   ,    2.   ],
       [  13.3  ,    8.   ,  350.   ,  245.   ,    3.73 ,    3.84 ,
          15.41 ,    0.   ,    0.   ,    3.   ,    4.   ],
       [  19.2  ,    8.   ,  400.   ,  175.   ,    3.08 ,    3.845,
          17.05 ,    0.   ,    0.   ,    3.   ,    2.   ],
       [  27.3  ,    4.   ,   79.   ,   66.   ,    4.08 ,    1.935,
          18.9  ,    1.   ,    1.   ,    4.   ,    1.   ],
       [  26.   ,    4.   ,  120.3  ,   91.   ,    4.43 ,    2.14 ,
          16.7  ,    0.   ,    1.   ,    5.   ,    2.   ],
       [  30.4  ,    4.   ,   95.1  ,  113.   ,    3.77 ,    1.513,
          16.9  ,    1.   ,    1.   ,    5.   ,    2.   ],
       [  15.8  ,    8.   ,  351.   ,  264.   ,    4.22 ,    3.17 ,
          14.5  ,    0.   ,    1.   ,    5.   ,    4.   ],
       [  19.7  ,    6.   ,  145.   ,  175.   ,    3.62 ,    2.77 ,
          15.5  ,    0.   ,    1.   ,    5.   ,    6.   ],
       [  15.   ,    8.   ,  301.   ,  335.   ,    3.54 ,    3.57 ,
          14.6  ,    0.   ,    1.   ,    5.   ,    8.   ],
       [  21.4  ,    4.   ,  121.   ,  109.   ,    4.11 ,    2.78 ,
          18.6  ,    1.   ,    1.   ,    4.   ,    2.   ]])

In [39]:

mtcars[25:] # rows 25 to end of data frame

Out[39]:

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Fiat X1-9	27.3	4	79.0	66	4.08	1.935	18.9	1	1	4	1
Porsche 914-2	26.0	4	120.3	91	4.43	2.140	16.7	0	1	5	2
Lotus Europa	30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2
Ford Pantera L	15.8	8	351.0	264	4.22	3.170	14.5	0	1	5	4
Ferrari Dino	19.7	6	145.0	175	3.62	2.770	15.5	0	1	5	6
Maserati Bora	15.0	8	301.0	335	3.54	3.570	14.6	0	1	5	8
Volvo 142E	21.4	4	121.0	109	4.11	2.780	18.6	1	1	4	2

In [40]:

# return index
mtcars.index

Out[40]:

Index([u'Mazda RX4', u'Mazda RX4 Wag', u'Datsun 710', u'Hornet 4 Drive', u'Hornet Sportabout', u'Valiant', u'Duster 360', u'Merc 240D', u'Merc 230', u'Merc 280', u'Merc 280C', u'Merc 450SE', u'Merc 450SL', u'Merc 450SLC', u'Cadillac Fleetwood', u'Lincoln Continental', u'Chrysler Imperial', u'Fiat 128', u'Honda Civic', u'Toyota Corolla', u'Toyota Corona', u'Dodge Challenger', u'AMC Javelin', u'Camaro Z28', u'Pontiac Firebird', u'Fiat X1-9', u'Porsche 914-2', u'Lotus Europa', u'Ford Pantera L', u'Ferrari Dino', u'Maserati Bora', u'Volvo 142E'], dtype='object')

In [41]:

mtcars.ix['Maserati Bora'] # access a row by an index

Out[41]:

mpg      15.00
cyl       8.00
disp    301.00
hp      335.00
drat      3.54
wt        3.57
qsec     14.60
vs        0.00
am        1.00
gear      5.00
carb      8.00
Name: Maserati Bora, dtype: float64

In [42]:

# What other methods are available when working with pandas DataFrames?
# type 'mtcars.' and then click <TAB>
# mtcars.<TAB>

# try it here

Exploratory Data Analysis (EDA)¶

Even though they may look like continuous variabes, cyl, vs, am, gear and carb are integer or categorical variables. First, let's look at some summary statistics of the mtcars data set.

In [43]:

mtcars.describe()

Out[43]:

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
count	32.000000	32.000000	32.000000	32.000000	32.000000	32.000000	32.000000	32.000000	32.000000	32.000000	32.0000
mean	20.090625	6.187500	230.721875	146.687500	3.596563	3.217250	17.848750	0.437500	0.406250	3.687500	2.8125
std	6.026948	1.785922	123.938694	68.562868	0.534679	0.978457	1.786943	0.504016	0.498991	0.737804	1.6152
min	10.400000	4.000000	71.100000	52.000000	2.760000	1.513000	14.500000	0.000000	0.000000	3.000000	1.0000
25%	15.425000	4.000000	120.825000	96.500000	3.080000	2.581250	16.892500	0.000000	0.000000	3.000000	2.0000
50%	19.200000	6.000000	196.300000	123.000000	3.695000	3.325000	17.710000	0.000000	0.000000	4.000000	2.0000
75%	22.800000	8.000000	326.000000	180.000000	3.920000	3.610000	18.900000	1.000000	1.000000	4.000000	4.0000
max	33.900000	8.000000	472.000000	335.000000	4.930000	5.424000	22.900000	1.000000	1.000000	5.000000	8.0000

Using conditional statements¶

To check if any or all elements in an array meet a certain criteria, use any() and all().

In [44]:

(mtcars.mpg >= 20).any()

Out[44]:

True

In [45]:

(mtcars > 0).all()

Out[45]:

mpg      True
cyl      True
disp     True
hp       True
drat     True
wt       True
qsec     True
vs      False
am      False
gear     True
carb     True
dtype: bool

Let's look at the distribution of mpg using a histogram.

In [46]:

mtcars['mpg'].hist()
plt.title('Distribution of MPG')
plt.xlabel('Miles Per Gallon')

Out[46]:

<matplotlib.text.Text at 0x108a22550>

In [47]:

# Relationship between cyl and mpg
plt.plot(mtcars.cyl, mtcars.mpg, 'o')
plt.xlim(3, 9)
plt.xlabel('Cylinders')
plt.ylabel('MPG')
plt.title('Relationship between cylinders and MPG')

Out[47]:

<matplotlib.text.Text at 0x10969b5d0>

In [48]:

# Relationship between horsepower and mpg
plt.plot(mtcars.hp, mtcars.mpg, 'o')
plt.xlabel('Horsepower')
plt.ylabel('MPG')
plt.title('Relationship between horsepower and MPG')

Out[48]:

<matplotlib.text.Text at 0x1097bc150>

In [49]:

from pandas.tools.plotting import scatter_matrix
scatter_matrix(mtcars[['mpg', 'hp', 'cyl']], 
               figsize = (10, 6), alpha = 1, diagonal='kde')

Out[49]:

array([[<matplotlib.axes.AxesSubplot object at 0x109811350>,
        <matplotlib.axes.AxesSubplot object at 0x10990a910>,
        <matplotlib.axes.AxesSubplot object at 0x1099899d0>],
       [<matplotlib.axes.AxesSubplot object at 0x1099eacd0>,
        <matplotlib.axes.AxesSubplot object at 0x109c6cc90>,
        <matplotlib.axes.AxesSubplot object at 0x109cd0b90>],
       [<matplotlib.axes.AxesSubplot object at 0x109d4dc50>,
        <matplotlib.axes.AxesSubplot object at 0x109d89d90>,
        <matplotlib.axes.AxesSubplot object at 0x109f3ca50>]], dtype=object)

Working on the command line¶

Now we will discuss working on the command line. For this section and the next section on git and GitHub we will use slides from the Data Science Specialization course on Coursera. These slides are available from

Command line interface

Introduction to git and GitHub¶

Next we introduce git and GitHub. For this section we will also use slides from Data Science Specialization course on Coursera. These slides are available from

Other useful resources for learning git and github:

Your turn¶

If you don't have a github account yet, register for a github account
Use git clone to clone the CS109 2014 course repository on Github
Use git clone to clone the CS109 2014 data repository on Github

In [ ]: