Data Analysis¶

In God we trust, all others bring data. - The Elements of Statistical Learning

Big Data, Data Analytics, Data Science etc are the common buzzwords of the data world. So much so that data is considered to be the "new oil". There are excellent data-specific programming tools like SAS, R, Hadoop. Using a more generic scripting language like Python for data analysis is helpful as it allows for combination of data tasks with scientific programming.

One major issue for statistical programmers using Python, in the past has been the lack of libraries implementing standard models and a cohesive framework for specifying models. Pandas, the data analysis library which has been in development since 2008, aims to bridge this gap.

Pandas derives its name from panel datasets, which is a commonly used term for multi-dimensional datasets encountered in statistics and econometrics.

Data analysis is only as good as its visualization. Today we will use a number of datasets in combination with the plotting library in Python; matplotlib to demonstrate our learnings. The notebook is structured as follows:

Contents¶

Data Analysis
Matplotlib
Data Analysis: pandas
DataFrames
Statistical Tests
Data Problem
- Data Cleaning
- Data Analysis
Miscellaneous plots
References
Credits

In [1]:

from __future__ import division
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt

pd.set_option('display.mpl_style', 'default')
#IPython magic command for inline plotting
%matplotlib inline
#a better plot shape for IPython
mpl.rcParams['figure.figsize']=[15,3]

Quick Overview of matplotlib¶

Matplotlib is the primary plotting library in Python. We will have a separate notebook dedicated to its features in a subsequent session. For the purpose of plotting with pandas today, we will touch upon the very basic plotting in matplotlib.

In [2]:

x = np.linspace(0, 1, 10001)
y = np.cos(np.pi/x) * np.exp(-x**2)

plt.plot(x, y)
plt.show()

Plot the following equations over the domain $x \in \left[-1, 2\right]$.
- $y = f(x) = x^2 \exp(-x)$
- $y = f(x) = \log x$
- $y = f(x) = 1 + x^x + 3 x^4$

In [3]:

x=np.linspace(-1, 2, 10001)
y = x**2*np.exp(-x)

plt.plot(x, y)
plt.show()

Data analysis: pandas¶

The pandas data analysis module provides data structures and tools for data analysis. It focuses on data handling and manipulation as well as linear and panel regression. It is designed to let you carry out your entire data workflow in Python without having to switch to a domain-specific language such as R. Although largely compatible with NumPy/SciPy, there are some important differences in indexing, data organization, and features. The basic Pandas data type is not ndarray, but Series and DataFrame. These allow you to index data and align axes efficiently.

Series¶

A Series object is a one-dimensional array which can hold any data type. Like a dictionary, it has a set of indices for access (like keys); unlike a dictionary, it is ordered. Data alignment is intrinsic and will not be broken unless you do it explicitly. It is very similar to ndarray from NumPy.

An arbitrary list of values can be used as the index, or a list of axis labels (so it can act something like a dict).

In [4]:

s = pd.Series([1,5,float('NaN'),7.5,2.1,3])
print(s)

0    1.0
1    5.0
2    NaN
3    7.5
4    2.1
5    3.0
dtype: float64

In [5]:

dates = pd.date_range('20140201', periods=s.size)
s.index = dates
print(s)

2014-02-01    1.0
2014-02-02    5.0
2014-02-03    NaN
2014-02-04    7.5
2014-02-05    2.1
2014-02-06    3.0
Freq: D, dtype: float64

In [6]:

letters = ['A', 'B', 'Ch', '#', '#', '---']
s.index = letters
print(s)
print('\nAccess is like a dictionary key:\ns[\'---\'] = '+str(s['---']))
print('\nRepeat labels are possible:\ns[\'#\']=\n'+str(s['#']))

A      1.0
B      5.0
Ch     NaN
#      7.5
#      2.1
---    3.0
dtype: float64

Access is like a dictionary key:
s['---'] = 3.0

Repeat labels are possible:
s['#']=
#    7.5
#    2.1
dtype: float64

NumPy functions expecting an ndarray often do just fine with Series as well.

In [7]:

t = np.exp(s)
print(t)

A         2.718282
B       148.413159
Ch             NaN
#      1808.042414
#         8.166170
---      20.085537
dtype: float64

String Methods¶

Series is equipped with a set of string processing methods that make it easy to operate on each element of the array. Perhaps most importantly, these methods exclude missing/NA values automatically. These are accessed via the Series’s str attribute and generally have names matching the equivalent (scalar) built-in string methods:

In [8]:

s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])

In [9]:

 s.str.upper()

Out[9]:

0       A
1       B
2       C
3    AABA
4    BACA
5     NaN
6    CABA
7     DOG
8     CAT
dtype: object

In [10]:

 s.str.lower()
    

Out[10]:

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

In [11]:

s.str.len()

Out[11]:

0     1
1     1
2     1
3     4
4     4
5   NaN
6     4
7     3
8     3
dtype: float64

In [12]:

s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'])
print s2

0    a_b_c
1    c_d_e
2      NaN
3    f_g_h
dtype: object

In [13]:

s2.str.split('_')

Out[13]:

0    [a, b, c]
1    [c, d, e]
2          NaN
3    [f, g, h]
dtype: object

Method	Description
cat	Concatenate strings
split	Split strings on delimiter
get	Index into each element (retrieve i-th element
join	Join strings in each element of the Series with passed separator
contains	Return boolean array if each string contains pattern/regex
replace	Replace occurrences of pattern/regex with some other string
repeat	Duplicate values (s.str.repeat(3) equivalent to x * 3)
pad	Add whitespace to left, right, or both sides of strings
center	Equivalent to pad(side='both')
wrap	Split long strings into lines with length less than a given width
slice	Slice each string in the Series
slice_replace	Replace slice in each string with passed value
count	Count occurrences of pattern
startswith	Equivalent to str.startswith(pat) for each element
endswith	Equivalent to str.endswith(pat) for each element
findall	Compute list of all occurrences of pattern/regex for each string
match	Call re.match on each element, returning matched groups as list
extract	Call re.match on each element, as match does, but return matched groups as strings for convenience.
len	Compute string lengths
strip	Equivalent to str.strip
rstrip	Equivalent to str.rstrip
lstrip	Equivalent to str.lstrip
lower	Equivalent to str.lower
upper	Equivalent to str.upper

Reading from a csv¶

In most data scenarios, you will receive a comma separated file, on which you will need to perform your analysis. Reading a csv file into Python can be achieved by using the read_csv function. We will use the date from this website, about how many people were on 7 different bike paths in Montreal, each day. Let's use the data from 2012.

In [14]:

broken_df = pd.read_csv('2012.csv')

#Look at the first 4 rows
broken_df[:4]

Out[14]:

	Date	Berri 1	Brébeuf (données non disponibles)	Côte-Sainte-Catherine	Maisonneuve 1	Maisonneuve 2	du Parc	Pierre-Dupuy	Rachel1	St-Urbain (données non disponibles)
0	01/01/2012	35	NaN	0	38	51	26	10	16	NaN
1	02/01/2012	83	NaN	1	68	153	53	6	43	NaN
2	03/01/2012	135	NaN	2	104	248	89	3	58	NaN
3	04/01/2012	144	NaN	1	116	318	111	8	61	NaN

In [15]:

fixed_df = pd.read_csv('2012.csv', index_col='Date')
fixed_df[:3]

Out[15]:

	Berri 1	Brébeuf (données non disponibles)	Côte-Sainte-Catherine	Maisonneuve 1	Maisonneuve 2	du Parc	Pierre-Dupuy	Rachel1	St-Urbain (données non disponibles)
Date
01/01/2012	35	NaN	0	38	51	26	10	16	NaN
02/01/2012	83	NaN	1	68	153	53	6	43	NaN
03/01/2012	135	NaN	2	104	248	89	3	58	NaN

DataFrame¶

What we did when we read the csv file into broken_df, we created a 2 Dimensional data structure called a DataFrame. The DataFrame object is similar to a table or a spreadsheet in Excel, i.e. a 2D Matrix-like object.

In [16]:

s = pd.Series([1,5,float('NaN'),7.5,2.1,3])
df = pd.DataFrame(s, columns=['x'])
print(df)

In [17]:

t=np.exp(s)
df['exp(x)'] = t
df['exp(exp(x))'] = np.exp(t)
print(df)

     x       exp(x)   exp(exp(x))
0  1.0     2.718282  1.515426e+01
1  5.0   148.413159  2.851124e+64
2  NaN          NaN           NaN
3  7.5  1808.042414           inf
4  2.1     8.166170  3.519837e+03
5  3.0    20.085537  5.284913e+08

There are a number of ways to access the elements of a DataFrame.

In [18]:

print(df['x'], '\n')     #column
#letters = ['A', 'B', 'Ch', '#', '#', '---']
#df.index=letters
#print(df.loc['#'], '\n') #row by label
#print(df.iloc[3], '\n')  #row by number (note the transposition in output!)
print(df[1:4])     #row by slice

(0    1.0
1    5.0
2    NaN
3    7.5
4    2.1
5    3.0
Name: x, dtype: float64, '\n')
     x       exp(x)   exp(exp(x))
1  5.0   148.413159  2.851124e+64
2  NaN          NaN           NaN
3  7.5  1808.042414           inf

Exercise 1 : DataFrames¶

In [19]:

df1=pd.DataFrame(np.random.randn(dates.size,4),index=dates,columns=list('ABCD'))
print df1

                   A         B         C         D
2014-02-01 -1.088830 -0.843649  0.923378 -0.857428
2014-02-02 -0.170466 -0.381519  0.437727 -0.146664
2014-02-03 -1.127080  0.098241  3.094603  0.250264
2014-02-04 -0.475779  0.803705 -0.216043  0.305970
2014-02-05  0.163283  1.145404  1.486144  0.894497
2014-02-06 -0.329314  0.235182 -0.552184 -0.436983

Using the DataFrame df1 created above, perform the following operations:

1. df1.head() and df1.tail()
2. df1.describe()
3. df1.T
4. df1.sort(columns='B')
5. df1.columns, df1.index, df1.values

In [20]:

df1.sort(columns=list('B'))

Out[20]:

	A	B	C	D
2014-02-01	-1.088830	-0.843649	0.923378	-0.857428
2014-02-02	-0.170466	-0.381519	0.437727	-0.146664
2014-02-03	-1.127080	0.098241	3.094603	0.250264
2014-02-06	-0.329314	0.235182	-0.552184	-0.436983
2014-02-04	-0.475779	0.803705	-0.216043	0.305970
2014-02-05	0.163283	1.145404	1.486144	0.894497

DataFrames: Data Manipulation¶

Now let us look at the cyclist DataFrame we created. To extract a column from the DataFrame,

In [21]:

fixed_df['Berri 1']

Out[21]:

Date
01/01/2012     35
02/01/2012     83
03/01/2012    135
04/01/2012    144
05/01/2012    197
06/01/2012    146
07/01/2012     98
08/01/2012     95
09/01/2012    244
10/01/2012    397
11/01/2012    273
12/01/2012    157
13/01/2012     75
14/01/2012     32
15/01/2012     54
...
22/10/2012    3650
23/10/2012    4177
24/10/2012    3744
25/10/2012    3735
26/10/2012    4290
27/10/2012    1857
28/10/2012    1310
29/10/2012    2919
30/10/2012    2887
31/10/2012    2634
01/11/2012    2405
02/11/2012    1582
03/11/2012     844
04/11/2012     966
05/11/2012    2247
Name: Berri 1, Length: 310, dtype: int64

We can use Boolean indexing on columns to extract information satisfying our desired conditions. For example, if I wished to extract all data from the cyclist data set where the value in the column Berri 1 is greater than 1000,

In [22]:

fixed_df[fixed_df['Berri 1'] > 1000]

Out[22]:

	Berri 1	Brébeuf (données non disponibles)	Côte-Sainte-Catherine	Maisonneuve 1	Maisonneuve 2	du Parc	Pierre-Dupuy	Rachel1	St-Urbain (données non disponibles)
Date
18/03/2012	1940	NaN	856	1036	1923	1021	1128	2477	NaN
19/03/2012	1821	NaN	1024	1278	2581	1609	506	2058	NaN
20/03/2012	2481	NaN	1261	1709	3130	1955	762	2609	NaN
21/03/2012	2829	NaN	1558	1893	3510	2225	993	2846	NaN
22/03/2012	2195	NaN	1030	1640	2654	1958	548	2254	NaN
23/03/2012	2115	NaN	1143	1512	2955	1791	663	2325	NaN
27/03/2012	1049	NaN	517	774	1576	972	163	1207	NaN
30/03/2012	1157	NaN	529	910	1596	957	196	1288	NaN
02/04/2012	1937	NaN	967	1537	2853	1614	394	2122	NaN
03/04/2012	2416	NaN	1078	1791	3556	1880	513	2450	NaN
04/04/2012	2211	NaN	933	1674	2956	1666	274	2242	NaN
05/04/2012	2424	NaN	1036	1823	3273	1699	355	2463	NaN
06/04/2012	1633	NaN	650	1045	1913	975	621	2138	NaN
07/04/2012	1208	NaN	494	739	1445	709	598	1566	NaN
08/04/2012	1164	NaN	560	621	1333	704	792	1533	NaN
10/04/2012	2183	NaN	909	1588	2932	1736	252	2108	NaN
11/04/2012	2328	NaN	1049	1765	3122	1843	330	2311	NaN
12/04/2012	3064	NaN	1483	2306	4076	2280	590	3213	NaN
13/04/2012	3341	NaN	1505	2565	4465	2358	922	3728	NaN
14/04/2012	2890	NaN	1072	1639	2994	1594	1284	3428	NaN
15/04/2012	2554	NaN	1210	1637	2954	1559	1846	3604	NaN
16/04/2012	3643	NaN	1841	2723	4830	2677	1061	3616	NaN
17/04/2012	3539	NaN	1616	2636	4592	2450	544	3333	NaN
18/04/2012	3570	NaN	1751	2759	4655	2534	706	3542	NaN
19/04/2012	4231	NaN	2010	3235	5311	2877	1206	3929	NaN
20/04/2012	2087	NaN	800	1529	2922	1531	170	2065	NaN
22/04/2012	1853	NaN	487	1224	1331	654	198	1779	NaN
24/04/2012	1810	NaN	720	1355	2379	1286	188	1753	NaN
25/04/2012	2966	NaN	1023	2228	3444	1800	445	2454	NaN
26/04/2012	2751	NaN	1069	2196	3546	1789	381	2438	NaN
...	...	...	...	...	...	...	...	...	...
04/10/2012	4034	NaN	2025	2705	4850	3066	555	3418	NaN
05/10/2012	4151	NaN	1977	2799	4688	2844	1035	4088	NaN
06/10/2012	1304	NaN	469	933	1589	776	236	1775	NaN
07/10/2012	1580	NaN	660	922	1629	860	695	2052	NaN
08/10/2012	1854	NaN	880	987	1818	1040	1115	2502	NaN
09/10/2012	4787	NaN	2210	3026	5138	3418	927	4078	NaN
10/10/2012	3115	NaN	1537	2081	3681	2608	560	2703	NaN
11/10/2012	3746	NaN	1857	2569	4694	3034	558	3457	NaN
12/10/2012	3169	NaN	1460	2261	4045	2564	448	3224	NaN
13/10/2012	1783	NaN	802	1205	2113	1183	681	2309	NaN
15/10/2012	3292	NaN	1678	2165	4197	2754	560	3183	NaN
16/10/2012	3739	NaN	1858	2684	4681	2997	554	3593	NaN
17/10/2012	4098	NaN	1964	2645	4836	3063	728	3834	NaN
18/10/2012	4671	NaN	2292	3129	5542	3477	1108	4245	NaN
19/10/2012	1313	NaN	597	885	1668	1209	111	1486	NaN
20/10/2012	2011	NaN	748	1323	2266	1213	797	2243	NaN
21/10/2012	1277	NaN	609	869	1777	898	242	1648	NaN
22/10/2012	3650	NaN	1819	2495	4800	3023	757	3721	NaN
23/10/2012	4177	NaN	1997	2795	5216	3233	795	3554	NaN
24/10/2012	3744	NaN	1868	2625	4900	3035	649	3622	NaN
25/10/2012	3735	NaN	1815	2528	5010	3017	631	3767	NaN
26/10/2012	4290	NaN	1987	2754	5246	3000	1456	4578	NaN
27/10/2012	1857	NaN	792	1244	2461	1193	618	2471	NaN
28/10/2012	1310	NaN	697	910	1776	955	387	1876	NaN
29/10/2012	2919	NaN	1458	2071	3768	2440	411	2795	NaN
30/10/2012	2887	NaN	1251	2007	3516	2255	338	2790	NaN
31/10/2012	2634	NaN	1294	1835	3453	2220	245	2570	NaN
01/11/2012	2405	NaN	1208	1701	3082	2076	165	2461	NaN
02/11/2012	1582	NaN	737	1109	2277	1392	97	1888	NaN
05/11/2012	2247	NaN	1170	1705	3221	2143	179	2430	NaN

218 rows × 9 columns

In [23]:

from pandas.util.testing import rands
df=pd.DataFrame(np.random.randn(dates.size,4),index=dates,columns=list('ABCD'))
print df

                   A         B         C         D
2014-02-01 -1.054596  1.121003 -0.320041 -0.692536
2014-02-02  0.714781 -0.604180  1.067904 -1.194036
2014-02-03 -0.009586  0.361285  1.257356  2.206935
2014-02-04  0.280065  0.011517  0.602386  0.275055
2014-02-05  2.100321  1.131649 -0.251465  0.250192
2014-02-06  0.808438 -0.122169 -2.123913 -1.208994

In [24]:

df[df.B>0]

Out[24]:

	A	B	C	D
2014-02-01	-1.054596	1.121003	-0.320041	-0.692536
2014-02-03	-0.009586	0.361285	1.257356	2.206935
2014-02-04	0.280065	0.011517	0.602386	0.275055
2014-02-05	2.100321	1.131649	-0.251465	0.250192

In [25]:

df[df > 0]

Out[25]:

	A	B	C	D
2014-02-01	NaN	1.121003	NaN	NaN
2014-02-02	0.714781	NaN	1.067904	NaN
2014-02-03	NaN	0.361285	1.257356	2.206935
2014-02-04	0.280065	0.011517	0.602386	0.275055
2014-02-05	2.100321	1.131649	NaN	0.250192
2014-02-06	0.808438	NaN	NaN	NaN

In [26]:

df2 = df.copy()
df2['E']=['one', 'one','two','three','four','three']
print df2

                   A         B         C         D      E
2014-02-01 -1.054596  1.121003 -0.320041 -0.692536    one
2014-02-02  0.714781 -0.604180  1.067904 -1.194036    one
2014-02-03 -0.009586  0.361285  1.257356  2.206935    two
2014-02-04  0.280065  0.011517  0.602386  0.275055  three
2014-02-05  2.100321  1.131649 -0.251465  0.250192   four
2014-02-06  0.808438 -0.122169 -2.123913 -1.208994  three

In [27]:

df2[df2['E'].isin(['one'])]

Out[27]:

	A	B	C	D	E
2014-02-01	-1.054596	1.121003	-0.320041	-0.692536	one
2014-02-02	0.714781	-0.604180	1.067904	-1.194036	one

In [28]:

df.at[dates[0],'A'] = 0
print df

                   A         B         C         D
2014-02-01  0.000000  1.121003 -0.320041 -0.692536
2014-02-02  0.714781 -0.604180  1.067904 -1.194036
2014-02-03 -0.009586  0.361285  1.257356  2.206935
2014-02-04  0.280065  0.011517  0.602386  0.275055
2014-02-05  2.100321  1.131649 -0.251465  0.250192
2014-02-06  0.808438 -0.122169 -2.123913 -1.208994

In [29]:

df.iat[0,1] = 0
print df

                   A         B         C         D
2014-02-01  0.000000  0.000000 -0.320041 -0.692536
2014-02-02  0.714781 -0.604180  1.067904 -1.194036
2014-02-03 -0.009586  0.361285  1.257356  2.206935
2014-02-04  0.280065  0.011517  0.602386  0.275055
2014-02-05  2.100321  1.131649 -0.251465  0.250192
2014-02-06  0.808438 -0.122169 -2.123913 -1.208994

Exercise 2 : Conditional data extraction¶

In [30]:

from random import randint
df = pd.DataFrame({'A': [randint(1, 9) for x in xrange(10)],
                   'B': [randint(1, 9)*10 for x in xrange(10)],
                   'C': [randint(1, 9)*100 for x in xrange(10)]})
print df

   A   B    C
0  8  30  600
1  8  10  300
2  8  80  600
3  6  80  300
4  3  10  600
5  4  90  700
6  2  50  300
7  4  40  700
8  8  40  100
9  7  70  100

Find the entries from A for which corresponding values for B will be greater than 50, and those in C equal to 900

In [30]:

Plotting Data¶

In [31]:

fixed_df['Berri 1'].plot()

Out[31]:

<matplotlib.axes.AxesSubplot at 0x10c685110>

Missing Data¶

Many data sets simply arrive with missing data, either because it exists and was not collected or it never existed. For example, in a collection of financial time series, some of the time series might start on different dates. Thus, values prior to the start date would generally be marked as missing.

As data comes in many shapes and forms, pandas aims to be flexible with regard to handling missing data. While NaN is the default missing value marker for reasons of computational speed and convenience, we need to be able to easily detect this value with data of different types: floating point, integer, boolean, and general object. In many cases, however, the Python None will arise and we wish to also consider that “missing” or “null”.

In [32]:

df= pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
print df

                   A         B         C         D
2014-02-01 -1.059138 -0.196474  0.239179  1.191028
2014-02-02 -0.641067  1.734050 -0.359996 -0.126219
2014-02-03  0.357383 -0.664820 -0.601961 -0.964749
2014-02-04 -0.340011  1.304989  0.717388 -0.268375
2014-02-05 -0.431061 -0.681776  0.147624  0.209896
2014-02-06 -0.326568  0.577446  0.682139  0.210614

In [33]:

df1 = df.reindex(index=dates[0:4],columns=list(df.columns) + ['E'])
print df1

                   A         B         C         D   E
2014-02-01 -1.059138 -0.196474  0.239179  1.191028 NaN
2014-02-02 -0.641067  1.734050 -0.359996 -0.126219 NaN
2014-02-03  0.357383 -0.664820 -0.601961 -0.964749 NaN
2014-02-04 -0.340011  1.304989  0.717388 -0.268375 NaN

In [34]:

df1.loc[dates[0]:dates[1],'E'] = 1
print df1

                   A         B         C         D   E
2014-02-01 -1.059138 -0.196474  0.239179  1.191028   1
2014-02-02 -0.641067  1.734050 -0.359996 -0.126219   1
2014-02-03  0.357383 -0.664820 -0.601961 -0.964749 NaN
2014-02-04 -0.340011  1.304989  0.717388 -0.268375 NaN

In [35]:

df1.dropna(how='all') #any

Out[35]:

	A	B	C	D	E
2014-02-01	-1.059138	-0.196474	0.239179	1.191028	1
2014-02-02	-0.641067	1.734050	-0.359996	-0.126219	1
2014-02-03	0.357383	-0.664820	-0.601961	-0.964749	NaN
2014-02-04	-0.340011	1.304989	0.717388	-0.268375	NaN

In [36]:

df1.fillna(value=15)

Out[36]:

	A	B	C	D	E
2014-02-01	-1.059138	-0.196474	0.239179	1.191028	1
2014-02-02	-0.641067	1.734050	-0.359996	-0.126219	1
2014-02-03	0.357383	-0.664820	-0.601961	-0.964749	15
2014-02-04	-0.340011	1.304989	0.717388	-0.268375	15

In [37]:

pd.isnull(df1)

Out[37]:

	A	B	C	D	E
2014-02-01	False	False	False	False	False
2014-02-02	False	False	False	False	False
2014-02-03	False	False	False	False	True
2014-02-04	False	False	False	False	True

Missing values propogate through arithmetic operations.

In [38]:

df2=pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
print df2
df2.loc[dates[0]:dates[2],'B']=float('NaN')
print df2
print df1+df2

                   A         B         C         D
2014-02-01 -1.654418  0.697809  0.991626  0.511788
2014-02-02 -1.163772  0.239813 -0.945453  0.192696
2014-02-03 -0.048510 -1.188224  1.718062  1.868163
2014-02-04 -0.277427 -1.509794  0.360021  2.071887
2014-02-05  0.008847 -2.179037 -0.074886  0.649411
2014-02-06 -0.019938  0.121653  1.180238 -1.312769
                   A         B         C         D
2014-02-01 -1.654418       NaN  0.991626  0.511788
2014-02-02 -1.163772       NaN -0.945453  0.192696
2014-02-03 -0.048510       NaN  1.718062  1.868163
2014-02-04 -0.277427 -1.509794  0.360021  2.071887
2014-02-05  0.008847 -2.179037 -0.074886  0.649411
2014-02-06 -0.019938  0.121653  1.180238 -1.312769
                   A         B         C         D   E
2014-02-01 -2.713556       NaN  1.230805  1.702816 NaN
2014-02-02 -1.804838       NaN -1.305449  0.066476 NaN
2014-02-03  0.308873       NaN  1.116101  0.903414 NaN
2014-02-04 -0.617438 -0.204804  1.077410  1.803513 NaN
2014-02-05       NaN       NaN       NaN       NaN NaN
2014-02-06       NaN       NaN       NaN       NaN NaN

But this can be avoided by using built-in methods, that exclude missing values.

In [39]:

df1['A'].sum()

Out[39]:

-1.6828330115428896

In [40]:

df1.mean(1)

Out[40]:

2014-02-01    0.234919
2014-02-02    0.321353
2014-02-03   -0.468537
2014-02-04    0.353498
Freq: D, dtype: float64

In [41]:

df2.cumsum()

Out[41]:

	A	B	C	D
2014-02-01	-1.654418	NaN	0.991626	0.511788
2014-02-02	-2.818190	NaN	0.046173	0.704484
2014-02-03	-2.866700	NaN	1.764235	2.572647
2014-02-04	-3.144127	-1.509794	2.124256	4.644534
2014-02-05	-3.135281	-3.688830	2.049370	5.293945
2014-02-06	-3.155219	-3.567178	3.229609	3.981177

In [42]:

#Gaussian numbers histogram
from numpy.random import normal
n = 1000
x = pd.Series(normal(size=n))
#print x
avg = x.mean()
std = x.std()

x_avg  = pd.Series(np.ones(n)* avg)
x_stdl = pd.Series(np.ones(n)*(avg-std))
x_stdh = pd.Series(np.ones(n)*(avg+std))

df_gauss=pd.DataFrame({'A':x_stdl,'B':x_stdh,'x':x})

df_gauss.plot(style=['rx','rx','bx'])
plt.figure()
df_gauss['x'].diff().hist(color='g', bins=50)

Out[42]:

<matplotlib.axes.AxesSubplot at 0x10c893b10>

Exercise 3: DataFrame methods¶

In [43]:

df=pd.DataFrame(np.random.randn(5,5), columns=list('ABCDE'))
print df

          A         B         C         D         E
0 -1.353547 -0.059735 -0.597045 -0.299746  1.335253
1 -0.621872 -0.592243  0.060789 -0.366381  0.186925
2  0.382292  0.201983  0.828402 -0.869741 -0.448232
3 -2.099593 -0.471666 -0.422174 -1.474813 -0.173611
4  1.540184 -0.721423 -0.135882 -0.793001 -0.629852

Try the following with df as defined above:

1. df.mean()
2. df.apply(np.cumsum)
3. df.apply(lambda x: x.max() - x.min())
4. Plot a histogram

In [43]:

In [44]:

df.apply(lambda x: x.max() - x.min())
#What does lambda do?

Out[44]:

A    3.639777
B    0.923406
C    1.425447
D    1.175067
E    1.965105
dtype: float64

In [45]:

def f(x):
...     return x*2
g = lambda x: x*2 

print g(3)

More manipulations¶

In [46]:

from pandas import read_csv
from urllib import urlopen
page = urlopen("http://econpy.pythonanywhere.com/ex/NFL_1979.csv")
df = read_csv(page)
print df[:3]

         Date          Visitor  Visitor Score             Home Team  \
0  09/01/1979    Detroit Lions             16  Tampa Bay Buccaneers   
1  09/02/1979  Atlanta Falcons             40    New Orleans Saints   
2  09/02/1979  Baltimore Colts              0    Kansas City Chiefs   

   Home Score  Line  Total Line  
0          31     3          30  
1          34     5          32  
2          14     1          37

In [47]:

df1=df[0:10]

In [48]:

print df1

         Date             Visitor  Visitor Score             Home Team  \
0  09/01/1979       Detroit Lions             16  Tampa Bay Buccaneers   
1  09/02/1979     Atlanta Falcons             40    New Orleans Saints   
2  09/02/1979     Baltimore Colts              0    Kansas City Chiefs   
3  09/02/1979  Cincinnati Bengals              0        Denver Broncos   
4  09/02/1979    Cleveland Browns             25         New York Jets   
5  09/02/1979      Dallas Cowboys             22    St Louis Cardinals   
6  09/02/1979   Green Bay Packers              3         Chicago Bears   
7  09/02/1979      Houston Oilers             29   Washington Redskins   
8  09/02/1979      Miami Dolphins              9         Buffalo Bills   
9  09/02/1979     New York Giants             17   Philadelphia Eagles   

   Home Score  Line  Total Line  
0          31     3        30.0  
1          34     5        32.0  
2          14     1        37.0  
3          10     3        31.5  
4          22     2        41.0  
5          21    -4        37.0  
6           6     3        31.0  
7          27    -4        33.0  
8           7    -5        39.0  
9          23     7        31.5

In [49]:

A=df1[:3]
B=df1[3:7]
C=df1[7:10]
print A,B,C

         Date          Visitor  Visitor Score             Home Team  \
0  09/01/1979    Detroit Lions             16  Tampa Bay Buccaneers   
1  09/02/1979  Atlanta Falcons             40    New Orleans Saints   
2  09/02/1979  Baltimore Colts              0    Kansas City Chiefs   

   Home Score  Line  Total Line  
0          31     3          30  
1          34     5          32  
2          14     1          37            Date             Visitor  Visitor Score           Home Team  \
3  09/02/1979  Cincinnati Bengals              0      Denver Broncos   
4  09/02/1979    Cleveland Browns             25       New York Jets   
5  09/02/1979      Dallas Cowboys             22  St Louis Cardinals   
6  09/02/1979   Green Bay Packers              3       Chicago Bears   

   Home Score  Line  Total Line  
3          10     3        31.5  
4          22     2        41.0  
5          21    -4        37.0  
6           6     3        31.0            Date          Visitor  Visitor Score            Home Team  \
7  09/02/1979   Houston Oilers             29  Washington Redskins   
8  09/02/1979   Miami Dolphins              9        Buffalo Bills   
9  09/02/1979  New York Giants             17  Philadelphia Eagles   

   Home Score  Line  Total Line  
7          27    -4        33.0  
8           7    -5        39.0  
9          23     7        31.5

In [50]:

parts=[A,B,C]
df2=pd.concat(parts)
print df2

         Date             Visitor  Visitor Score             Home Team  \
0  09/01/1979       Detroit Lions             16  Tampa Bay Buccaneers   
1  09/02/1979     Atlanta Falcons             40    New Orleans Saints   
2  09/02/1979     Baltimore Colts              0    Kansas City Chiefs   
3  09/02/1979  Cincinnati Bengals              0        Denver Broncos   
4  09/02/1979    Cleveland Browns             25         New York Jets   
5  09/02/1979      Dallas Cowboys             22    St Louis Cardinals   
6  09/02/1979   Green Bay Packers              3         Chicago Bears   
7  09/02/1979      Houston Oilers             29   Washington Redskins   
8  09/02/1979      Miami Dolphins              9         Buffalo Bills   
9  09/02/1979     New York Giants             17   Philadelphia Eagles   

   Home Score  Line  Total Line  
0          31     3        30.0  
1          34     5        32.0  
2          14     1        37.0  
3          10     3        31.5  
4          22     2        41.0  
5          21    -4        37.0  
6           6     3        31.0  
7          27    -4        33.0  
8           7    -5        39.0  
9          23     7        31.5

In [51]:

left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right= pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
print left
print right 

   key  lval
0  foo     1
1  foo     2
   key  rval
0  foo     4
1  foo     5

In [52]:

pd.merge(left, right, on='key')

Out[52]:

	key	lval	rval
0	foo	1	4
1	foo	1	5
2	foo	2	4
3	foo	2	5

In [53]:

df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])

In [54]:

rowadd=df.iloc[3]

In [55]:

print rowadd,df

A    0.029758
B   -1.297145
C    0.322048
D   -2.251859
Name: 3, dtype: float64           A         B         C         D
0  0.096170 -0.100691 -0.083139  0.157201
1 -1.167492 -1.671596 -1.636372  0.895365
2  0.704329 -0.995257 -0.027827  1.043940
3  0.029758 -1.297145  0.322048 -2.251859
4 -0.086081 -0.144922  0.673781 -0.468736
5 -0.196530 -1.329766 -0.142091  0.161405
6  0.870045  1.492161 -1.159585  0.299559
7  1.599507  0.218105  0.400768 -0.442777

In [56]:

df.append(rowadd,ignore_index=True)

Out[56]:

	A	B	C	D
0	0.096170	-0.100691	-0.083139	0.157201
1	-1.167492	-1.671596	-1.636372	0.895365
2	0.704329	-0.995257	-0.027827	1.043940
3	0.029758	-1.297145	0.322048	-2.251859
4	-0.086081	-0.144922	0.673781	-0.468736
5	-0.196530	-1.329766	-0.142091	0.161405
6	0.870045	1.492161	-1.159585	0.299559
7	1.599507	0.218105	0.400768	-0.442777
8	0.029758	-1.297145	0.322048	-2.251859

In [57]:

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
   ....:                          'foo', 'bar', 'foo', 'foo'],
   ....:                    'B' : ['one', 'one', 'two', 'three',
   ....:                          'two', 'two', 'one', 'three'],
   ....:                    'C' : np.random.randn(8),
   ....:                    'D' : np.random.randn(8)})
print df

     A      B         C         D
0  foo    one -0.508353 -1.725045
1  bar    one -0.014625  1.706129
2  foo    two  0.041269  1.253173
3  bar  three  1.302055 -1.497082
4  foo    two  0.116896  0.007921
5  bar    two -0.009417 -0.083856
6  foo    one -1.478390 -0.921723
7  foo  three  0.451666 -0.119239

In [58]:

df.groupby('A').sum()

Out[58]:

	C	D
A
bar	1.278014	0.125191
foo	-1.376912	-1.504914

In [59]:

df.groupby(['A','B']).sum()

Out[59]:

		C	D
A	B
bar	one	-0.014625	1.706129
	three	1.302055	-1.497082
	two	-0.009417	-0.083856
foo	one	-1.986743	-2.646768
	three	0.451666	-0.119239
	two	0.158165	1.261093

Statistical Tests¶

pandas allows for using some built-in statistical methods to compare, fit or interpolate data.

Regression¶

Regression analysis refers to the process of estimating relationships between variables. Linear regression is equivalent to fitting a line between to sets of data points (x,y)

$$y_i(x) = a_0 + a_1x_i $$

In [60]:

import statsmodels.formula.api as sm
import matplotlib.pyplot as plt
url = "http://vincentarelbundock.github.com/Rdatasets/csv/HistData/Guerry.csv"
df = pd.read_csv(url)
#print df
df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
df.head()

Out[60]:

	Lottery	Literacy	Wealth	Region
0	41	37	73	E
1	38	51	22	N
2	66	13	61	C
3	80	46	76	E
4	79	69	83	E

In [61]:

mod = sm.ols(formula='Lottery ~ Literacy ', data=df)
res = mod.fit()
print res.summary()
intercept, slope =res.params

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                Lottery   R-squared:                       0.146
Model:                            OLS   Adj. R-squared:                  0.135
Method:                 Least Squares   F-statistic:                     14.16
Date:                Mon, 09 Feb 2015   Prob (F-statistic):           0.000312
Time:                        14:29:19   Log-Likelihood:                -386.13
No. Observations:                  85   AIC:                             776.3
Df Residuals:                      83   BIC:                             781.2
Df Model:                           1                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     64.2389      6.163     10.423      0.000        51.981    76.497
Literacy      -0.5417      0.144     -3.763      0.000        -0.828    -0.255
==============================================================================
Omnibus:                        7.455   Durbin-Watson:                   2.010
Prob(Omnibus):                  0.024   Jarque-Bera (JB):                2.936
Skew:                           0.061   Prob(JB):                        0.230
Kurtosis:                       2.098   Cond. No.                         106.
==============================================================================

In [62]:

xtest=np.linspace(1,100,100)
ytest=intercept+slope*xtest

In [63]:

plt.plot(df['Literacy'],df['Lottery'],'kx')
plt.plot(xtest,ytest,'r')
plt.show()

T-Test¶

The t-test assesses whether the means of two groups are statistically different from each other.

In [64]:

town1_heights = pd.Series([5, 6, 7, 6, 7.1, 6, 4])
town2_heights = pd.Series([5.5, 6.5, 7, 6, 7.1, 6])

town1_mean = town1_heights.mean()
town2_mean = town2_heights.mean()

print "Town 1 avg. height", town1_mean
print "Town 2 avg. height", town2_mean

print "Effect size: ", abs(town1_mean - town2_mean)

df=pd.DataFrame({'T1':town1_heights,'T2':town2_heights})
b=df.boxplot()

Town 1 avg. height 5.87142857143
Town 2 avg. height 6.35
Effect size:  0.478571428571

/Users/lrao/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/tools/plotting.py:2380: FutureWarning: 
The default value for 'return_type' will change to 'axes' in a future release.
 To use the future behavior now, set return_type='axes'.
 To keep the previous behavior and silence this warning, set return_type='dict'.
  warnings.warn(msg, FutureWarning)

In [65]:

from scipy import stats

print "Town 1 Shapiro-Wilks p-value", stats.shapiro(town1_heights)[1]

print " T-Test p-value:", stats.ttest_ind(town1_heights, town2_heights,equal_var = False)[1]

Town 1 Shapiro-Wilks p-value 0.380458295345
 T-Test p-value: 0.347028503558

Time Series¶

A time series is a sequence of data points, measured typically at successive points in time spaced at uniform time intervals.

In [66]:

rng = pd.date_range('1/1/2012', periods=100, freq='S')
print rng

<class 'pandas.tseries.index.DatetimeIndex'>
[2012-01-01 00:00:00, ..., 2012-01-01 00:01:39]
Length: 100, Freq: S, Timezone: None

In [67]:

ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts.plot()

Out[67]:

<matplotlib.axes.AxesSubplot at 0x10f5b9ed0>

In [68]:

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts=ts.cumsum()
ts.plot()

Out[68]:

<matplotlib.axes.AxesSubplot at 0x10f7e3490>

In [69]:

#Bar plot

ts = pd.DataFrame(np.random.randn(1000,5), index=pd.date_range('1/1/2000', periods=1000))
ts=ts.cumsum()
print ts.ix[5]
ts.ix[5].plot(kind='bar'); plt.axhline(0, color='k')

0   -4.948548
1   -1.926755
2    2.653564
3    0.093642
4   -3.095961
Name: 2000-01-06 00:00:00, dtype: float64

Out[69]:

<matplotlib.lines.Line2D at 0x10f7aee10>

Data Problem¶

Imagine yourself to be a sales analyst at an apparel company. Your boss asks you to look at weather data from the past year to understand the weather data over the months, so that you can have the right apparel on display at the appropriate time.

You can get the data from here (so your company is Canadian). The template for downloading the data is:

In [70]:

url_template = "http://climate.weather.gc.ca/climateData/bulkdata_e.html?format=csv&stationID=5415&Year={year}&Month={month}&timeframe=1&submit=Download+Data"

Usually in data tasks, there are no specified objectives. One needs to play around with the data in order to derive inferences. While this might seem like a vague and daunting task, it simply requires a start and once you get familiar with the data, you will eventually find some patterns and will be able to make an initial set of conclusions.

Here let's start with the data for March 2012 (there seems to be less data for the more recent years).

Data Cleaning¶

In [71]:

url = url_template.format(month=3, year=2012)
weather_mar2012 = pd.read_csv(url, skiprows=16, index_col='Date/Time', parse_dates=True, encoding='latin1')

In [72]:

weather_mar2012

Out[72]:

	Year	Month	Day	Time	Data Quality	Temp (Â°C)	Temp Flag	Dew Point Temp (Â°C)	Dew Point Temp Flag	Rel Hum (%)	...	Wind Spd Flag	Visibility (km)	Visibility Flag	Stn Press (kPa)	Stn Press Flag	Hmdx	Hmdx Flag	Wind Chill	Wind Chill Flag	Weather
Date/Time
2012-03-01 00:00:00	2012	3	1	00:00		-5.5	NaN	-9.7	NaN	72	...	NaN	4.0	NaN	100.97	NaN	NaN	NaN	-13	NaN	Snow
2012-03-01 01:00:00	2012	3	1	01:00		-5.7	NaN	-8.7	NaN	79	...	NaN	2.4	NaN	100.87	NaN	NaN	NaN	-13	NaN	Snow
2012-03-01 02:00:00	2012	3	1	02:00		-5.4	NaN	-8.3	NaN	80	...	NaN	4.8	NaN	100.80	NaN	NaN	NaN	-13	NaN	Snow
2012-03-01 03:00:00	2012	3	1	03:00		-4.7	NaN	-7.7	NaN	79	...	NaN	4.0	NaN	100.69	NaN	NaN	NaN	-12	NaN	Snow
2012-03-01 04:00:00	2012	3	1	04:00		-5.4	NaN	-7.8	NaN	83	...	NaN	1.6	NaN	100.62	NaN	NaN	NaN	-14	NaN	Snow
2012-03-01 05:00:00	2012	3	1	05:00		-5.3	NaN	-7.9	NaN	82	...	NaN	2.4	NaN	100.58	NaN	NaN	NaN	-14	NaN	Snow
2012-03-01 06:00:00	2012	3	1	06:00		-5.2	NaN	-7.8	NaN	82	...	NaN	4.0	NaN	100.57	NaN	NaN	NaN	-14	NaN	Snow
2012-03-01 07:00:00	2012	3	1	07:00		-4.9	NaN	-7.4	NaN	83	...	NaN	1.6	NaN	100.59	NaN	NaN	NaN	-13	NaN	Snow
2012-03-01 08:00:00	2012	3	1	08:00		-5.0	NaN	-7.5	NaN	83	...	NaN	1.2	NaN	100.59	NaN	NaN	NaN	-13	NaN	Snow
2012-03-01 09:00:00	2012	3	1	09:00		-4.9	NaN	-7.5	NaN	82	...	NaN	1.6	NaN	100.60	NaN	NaN	NaN	-13	NaN	Snow
2012-03-01 10:00:00	2012	3	1	10:00		-4.7	NaN	-7.3	NaN	82	...	NaN	1.2	NaN	100.62	NaN	NaN	NaN	-13	NaN	Snow
2012-03-01 11:00:00	2012	3	1	11:00		-4.4	NaN	-6.8	NaN	83	...	NaN	1.0	NaN	100.66	NaN	NaN	NaN	-12	NaN	Snow
2012-03-01 12:00:00	2012	3	1	12:00		-4.3	NaN	-6.8	NaN	83	...	NaN	1.2	NaN	100.66	NaN	NaN	NaN	-12	NaN	Snow
2012-03-01 13:00:00	2012	3	1	13:00		-4.3	NaN	-6.9	NaN	82	...	NaN	1.2	NaN	100.65	NaN	NaN	NaN	-12	NaN	Snow
2012-03-01 14:00:00	2012	3	1	14:00		-3.9	NaN	-6.6	NaN	81	...	NaN	1.2	NaN	100.67	NaN	NaN	NaN	-11	NaN	Snow
2012-03-01 15:00:00	2012	3	1	15:00		-3.3	NaN	-6.2	NaN	80	...	NaN	1.6	NaN	100.71	NaN	NaN	NaN	-10	NaN	Snow
2012-03-01 16:00:00	2012	3	1	16:00		-2.7	NaN	-5.7	NaN	80	...	NaN	2.4	NaN	100.74	NaN	NaN	NaN	-8	NaN	Snow
2012-03-01 17:00:00	2012	3	1	17:00		-2.9	NaN	-5.9	NaN	80	...	NaN	4.0	NaN	100.80	NaN	NaN	NaN	-9	NaN	Snow
2012-03-01 18:00:00	2012	3	1	18:00		-3.0	NaN	-6.0	NaN	80	...	NaN	4.0	NaN	100.87	NaN	NaN	NaN	-9	NaN	Snow
2012-03-01 19:00:00	2012	3	1	19:00		-3.6	NaN	-6.4	NaN	81	...	NaN	3.2	NaN	100.93	NaN	NaN	NaN	-9	NaN	Snow
2012-03-01 20:00:00	2012	3	1	20:00		-3.7	NaN	-6.4	NaN	81	...	NaN	4.8	NaN	100.95	NaN	NaN	NaN	-10	NaN	Snow
2012-03-01 21:00:00	2012	3	1	21:00		-3.9	NaN	-6.7	NaN	81	...	NaN	6.4	NaN	100.98	NaN	NaN	NaN	-10	NaN	Snow
2012-03-01 22:00:00	2012	3	1	22:00		-4.3	NaN	-6.9	NaN	82	...	NaN	2.4	NaN	101.00	NaN	NaN	NaN	-11	NaN	Snow
2012-03-01 23:00:00	2012	3	1	23:00		-4.3	NaN	-7.1	NaN	81	...	NaN	4.8	NaN	101.04	NaN	NaN	NaN	-11	NaN	Snow
2012-03-02 00:00:00	2012	3	2	00:00		-4.8	NaN	-7.3	NaN	83	...	NaN	3.2	NaN	101.04	NaN	NaN	NaN	-12	NaN	Snow
2012-03-02 01:00:00	2012	3	2	01:00		-5.3	NaN	-7.9	NaN	82	...	NaN	4.8	NaN	101.09	NaN	NaN	NaN	-12	NaN	Snow
2012-03-02 02:00:00	2012	3	2	02:00		-5.2	NaN	-7.8	NaN	82	...	NaN	6.4	NaN	101.11	NaN	NaN	NaN	-12	NaN	Snow
2012-03-02 03:00:00	2012	3	2	03:00		-5.5	NaN	-7.9	NaN	83	...	NaN	4.8	NaN	101.15	NaN	NaN	NaN	-12	NaN	Snow
2012-03-02 04:00:00	2012	3	2	04:00		-5.6	NaN	-8.2	NaN	82	...	NaN	6.4	NaN	101.15	NaN	NaN	NaN	-13	NaN	Snow
2012-03-02 05:00:00	2012	3	2	05:00		-5.5	NaN	-8.3	NaN	81	...	NaN	12.9	NaN	101.15	NaN	NaN	NaN	-12	NaN	Snow
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2012-03-30 18:00:00	2012	3	30	18:00		3.9	NaN	-7.9	NaN	42	...	NaN	24.1	NaN	101.26	NaN	NaN	NaN	NaN	NaN	Mostly Cloudy
2012-03-30 19:00:00	2012	3	30	19:00		3.1	NaN	-6.7	NaN	49	...	NaN	25.0	NaN	101.29	NaN	NaN	NaN	NaN	NaN	Mostly Cloudy
2012-03-30 20:00:00	2012	3	30	20:00		3.0	NaN	-8.4	NaN	43	...	NaN	25.0	NaN	101.30	NaN	NaN	NaN	NaN	NaN	Mostly Cloudy
2012-03-30 21:00:00	2012	3	30	21:00		1.7	NaN	-9.0	NaN	45	...	NaN	25.0	NaN	101.32	NaN	NaN	NaN	NaN	NaN	Cloudy
2012-03-30 22:00:00	2012	3	30	22:00		0.4	NaN	-8.1	NaN	53	...	NaN	25.0	NaN	101.30	NaN	NaN	NaN	NaN	NaN	Mostly Cloudy
2012-03-30 23:00:00	2012	3	30	23:00		1.4	NaN	-7.7	NaN	51	...	NaN	25.0	NaN	101.34	NaN	NaN	NaN	NaN	NaN	Mainly Clear
2012-03-31 00:00:00	2012	3	31	00:00		1.5	NaN	-8.6	NaN	47	...	NaN	25.0	NaN	101.33	NaN	NaN	NaN	NaN	NaN	Mostly Cloudy
2012-03-31 01:00:00	2012	3	31	01:00		1.3	NaN	-9.6	NaN	44	...	NaN	25.0	NaN	101.31	NaN	NaN	NaN	NaN	NaN	Mostly Cloudy
2012-03-31 02:00:00	2012	3	31	02:00		1.3	NaN	-9.7	NaN	44	...	NaN	25.0	NaN	101.29	NaN	NaN	NaN	NaN	NaN	Cloudy
2012-03-31 03:00:00	2012	3	31	03:00		0.7	NaN	-8.8	NaN	49	...	NaN	25.0	NaN	101.30	NaN	NaN	NaN	NaN	NaN	Cloudy
2012-03-31 04:00:00	2012	3	31	04:00		-0.9	NaN	-8.5	NaN	56	...	NaN	25.0	NaN	101.32	NaN	NaN	NaN	-5	NaN	Cloudy
2012-03-31 05:00:00	2012	3	31	05:00		-0.6	NaN	-9.2	NaN	52	...	NaN	25.0	NaN	101.30	NaN	NaN	NaN	-5	NaN	Cloudy
2012-03-31 06:00:00	2012	3	31	06:00		-0.5	NaN	-9.2	NaN	52	...	NaN	48.3	NaN	101.32	NaN	NaN	NaN	-5	NaN	Cloudy
2012-03-31 07:00:00	2012	3	31	07:00		-0.3	NaN	-9.2	NaN	51	...	NaN	48.3	NaN	101.32	NaN	NaN	NaN	-5	NaN	Cloudy
2012-03-31 08:00:00	2012	3	31	08:00		0.7	NaN	-8.5	NaN	50	...	NaN	48.3	NaN	101.33	NaN	NaN	NaN	NaN	NaN	Cloudy
2012-03-31 09:00:00	2012	3	31	09:00		1.5	NaN	-7.8	NaN	50	...	NaN	48.3	NaN	101.34	NaN	NaN	NaN	NaN	NaN	Mostly Cloudy
2012-03-31 10:00:00	2012	3	31	10:00		2.9	NaN	-8.1	NaN	44	...	NaN	48.3	NaN	101.30	NaN	NaN	NaN	NaN	NaN	Mainly Clear
2012-03-31 11:00:00	2012	3	31	11:00		4.6	NaN	-9.7	NaN	35	...	NaN	48.3	NaN	101.24	NaN	NaN	NaN	NaN	NaN	Clear
2012-03-31 12:00:00	2012	3	31	12:00		6.4	NaN	-7.1	NaN	37	...	NaN	48.3	NaN	101.16	NaN	NaN	NaN	NaN	NaN	Clear
2012-03-31 13:00:00	2012	3	31	13:00		6.5	NaN	-9.7	NaN	30	...	NaN	48.3	NaN	101.08	NaN	NaN	NaN	NaN	NaN	Clear
2012-03-31 14:00:00	2012	3	31	14:00		7.7	NaN	-8.5	NaN	31	...	NaN	48.3	NaN	101.01	NaN	NaN	NaN	NaN	NaN	Mainly Clear
2012-03-31 15:00:00	2012	3	31	15:00		7.7	NaN	-8.6	NaN	30	...	NaN	48.3	NaN	100.94	NaN	NaN	NaN	NaN	NaN	Mainly Clear
2012-03-31 16:00:00	2012	3	31	16:00		8.4	NaN	-7.7	NaN	31	...	NaN	48.3	NaN	100.89	NaN	NaN	NaN	NaN	NaN	Mainly Clear
2012-03-31 17:00:00	2012	3	31	17:00		7.9	NaN	-8.1	NaN	31	...	NaN	48.3	NaN	100.88	NaN	NaN	NaN	NaN	NaN	Mainly Clear
2012-03-31 18:00:00	2012	3	31	18:00		7.0	NaN	-8.2	NaN	33	...	NaN	48.3	NaN	100.87	NaN	NaN	NaN	NaN	NaN	Mainly Clear
2012-03-31 19:00:00	2012	3	31	19:00		5.9	NaN	-8.0	NaN	36	...	NaN	25.0	NaN	100.88	NaN	NaN	NaN	NaN	NaN	Clear
2012-03-31 20:00:00	2012	3	31	20:00		4.4	NaN	-7.2	NaN	43	...	NaN	25.0	NaN	100.85	NaN	NaN	NaN	NaN	NaN	Clear
2012-03-31 21:00:00	2012	3	31	21:00		2.6	NaN	-6.3	NaN	52	...	NaN	25.0	NaN	100.86	NaN	NaN	NaN	NaN	NaN	Clear
2012-03-31 22:00:00	2012	3	31	22:00		2.7	NaN	-6.7	NaN	50	...	NaN	25.0	NaN	100.82	NaN	NaN	NaN	NaN	NaN	Clear
2012-03-31 23:00:00	2012	3	31	23:00		1.5	NaN	-6.9	NaN	54	...	NaN	25.0	NaN	100.79	NaN	NaN	NaN	NaN	NaN	Clear

744 rows × 24 columns

We are only interested in the temperatures, so let's go ahead and plot the column for the month of March,2012. But we also see that the Temp column has some special characters, which might be painful to reuse. So, let's fix that first!

In [73]:

weather_mar2012.columns = [
    u'Year', u'Month', u'Day', u'Time', u'Data Quality', u'Temp (C)', 
    u'Temp Flag', u'Dew Point Temp (C)', u'Dew Point Temp Flag', 
    u'Rel Hum (%)', u'Rel Hum Flag', u'Wind Dir (10s deg)', u'Wind Dir Flag', 
    u'Wind Spd (km/h)', u'Wind Spd Flag', u'Visibility (km)', u'Visibility Flag',
    u'Stn Press (kPa)', u'Stn Press Flag', u'Hmdx', u'Hmdx Flag', u'Wind Chill', 
    u'Wind Chill Flag', u'Weather']

In [74]:

weather_mar2012[u'Temp (C)'].plot(figsize=(15, 5))

Out[74]:

<matplotlib.axes.AxesSubplot at 0x10fe28250>

There are also many columns with NA values. We cannot use them in any of our analyses, so we can go ahead and drop them.

In [75]:

weather_mar2012 = weather_mar2012.dropna(axis=1, how='any')
weather_mar2012

Out[75]:

	Year	Month	Day	Time	Data Quality	Temp (C)	Dew Point Temp (C)	Rel Hum (%)	Wind Spd (km/h)	Visibility (km)	Stn Press (kPa)	Weather
Date/Time
2012-03-01 00:00:00	2012	3	1	00:00		-5.5	-9.7	72	24	4.0	100.97	Snow
2012-03-01 01:00:00	2012	3	1	01:00		-5.7	-8.7	79	26	2.4	100.87	Snow
2012-03-01 02:00:00	2012	3	1	02:00		-5.4	-8.3	80	28	4.8	100.80	Snow
2012-03-01 03:00:00	2012	3	1	03:00		-4.7	-7.7	79	28	4.0	100.69	Snow
2012-03-01 04:00:00	2012	3	1	04:00		-5.4	-7.8	83	35	1.6	100.62	Snow
2012-03-01 05:00:00	2012	3	1	05:00		-5.3	-7.9	82	33	2.4	100.58	Snow
2012-03-01 06:00:00	2012	3	1	06:00		-5.2	-7.8	82	33	4.0	100.57	Snow
2012-03-01 07:00:00	2012	3	1	07:00		-4.9	-7.4	83	30	1.6	100.59	Snow
2012-03-01 08:00:00	2012	3	1	08:00		-5.0	-7.5	83	32	1.2	100.59	Snow
2012-03-01 09:00:00	2012	3	1	09:00		-4.9	-7.5	82	32	1.6	100.60	Snow
2012-03-01 10:00:00	2012	3	1	10:00		-4.7	-7.3	82	32	1.2	100.62	Snow
2012-03-01 11:00:00	2012	3	1	11:00		-4.4	-6.8	83	28	1.0	100.66	Snow
2012-03-01 12:00:00	2012	3	1	12:00		-4.3	-6.8	83	30	1.2	100.66	Snow
2012-03-01 13:00:00	2012	3	1	13:00		-4.3	-6.9	82	28	1.2	100.65	Snow
2012-03-01 14:00:00	2012	3	1	14:00		-3.9	-6.6	81	28	1.2	100.67	Snow
2012-03-01 15:00:00	2012	3	1	15:00		-3.3	-6.2	80	24	1.6	100.71	Snow
2012-03-01 16:00:00	2012	3	1	16:00		-2.7	-5.7	80	19	2.4	100.74	Snow
2012-03-01 17:00:00	2012	3	1	17:00		-2.9	-5.9	80	20	4.0	100.80	Snow
2012-03-01 18:00:00	2012	3	1	18:00		-3.0	-6.0	80	19	4.0	100.87	Snow
2012-03-01 19:00:00	2012	3	1	19:00		-3.6	-6.4	81	17	3.2	100.93	Snow
2012-03-01 20:00:00	2012	3	1	20:00		-3.7	-6.4	81	20	4.8	100.95	Snow
2012-03-01 21:00:00	2012	3	1	21:00		-3.9	-6.7	81	22	6.4	100.98	Snow
2012-03-01 22:00:00	2012	3	1	22:00		-4.3	-6.9	82	22	2.4	101.00	Snow
2012-03-01 23:00:00	2012	3	1	23:00		-4.3	-7.1	81	22	4.8	101.04	Snow
2012-03-02 00:00:00	2012	3	2	00:00		-4.8	-7.3	83	22	3.2	101.04	Snow
2012-03-02 01:00:00	2012	3	2	01:00		-5.3	-7.9	82	20	4.8	101.09	Snow
2012-03-02 02:00:00	2012	3	2	02:00		-5.2	-7.8	82	19	6.4	101.11	Snow
2012-03-02 03:00:00	2012	3	2	03:00		-5.5	-7.9	83	19	4.8	101.15	Snow
2012-03-02 04:00:00	2012	3	2	04:00		-5.6	-8.2	82	24	6.4	101.15	Snow
2012-03-02 05:00:00	2012	3	2	05:00		-5.5	-8.3	81	19	12.9	101.15	Snow
...	...	...	...	...	...	...	...	...	...	...	...	...
2012-03-30 18:00:00	2012	3	30	18:00		3.9	-7.9	42	11	24.1	101.26	Mostly Cloudy
2012-03-30 19:00:00	2012	3	30	19:00		3.1	-6.7	49	7	25.0	101.29	Mostly Cloudy
2012-03-30 20:00:00	2012	3	30	20:00		3.0	-8.4	43	7	25.0	101.30	Mostly Cloudy
2012-03-30 21:00:00	2012	3	30	21:00		1.7	-9.0	45	4	25.0	101.32	Cloudy
2012-03-30 22:00:00	2012	3	30	22:00		0.4	-8.1	53	0	25.0	101.30	Mostly Cloudy
2012-03-30 23:00:00	2012	3	30	23:00		1.4	-7.7	51	6	25.0	101.34	Mainly Clear
2012-03-31 00:00:00	2012	3	31	00:00		1.5	-8.6	47	13	25.0	101.33	Mostly Cloudy
2012-03-31 01:00:00	2012	3	31	01:00		1.3	-9.6	44	13	25.0	101.31	Mostly Cloudy
2012-03-31 02:00:00	2012	3	31	02:00		1.3	-9.7	44	11	25.0	101.29	Cloudy
2012-03-31 03:00:00	2012	3	31	03:00		0.7	-8.8	49	13	25.0	101.30	Cloudy
2012-03-31 04:00:00	2012	3	31	04:00		-0.9	-8.5	56	13	25.0	101.32	Cloudy
2012-03-31 05:00:00	2012	3	31	05:00		-0.6	-9.2	52	13	25.0	101.30	Cloudy
2012-03-31 06:00:00	2012	3	31	06:00		-0.5	-9.2	52	15	48.3	101.32	Cloudy
2012-03-31 07:00:00	2012	3	31	07:00		-0.3	-9.2	51	19	48.3	101.32	Cloudy
2012-03-31 08:00:00	2012	3	31	08:00		0.7	-8.5	50	17	48.3	101.33	Cloudy
2012-03-31 09:00:00	2012	3	31	09:00		1.5	-7.8	50	17	48.3	101.34	Mostly Cloudy
2012-03-31 10:00:00	2012	3	31	10:00		2.9	-8.1	44	15	48.3	101.30	Mainly Clear
2012-03-31 11:00:00	2012	3	31	11:00		4.6	-9.7	35	7	48.3	101.24	Clear
2012-03-31 12:00:00	2012	3	31	12:00		6.4	-7.1	37	11	48.3	101.16	Clear
2012-03-31 13:00:00	2012	3	31	13:00		6.5	-9.7	30	9	48.3	101.08	Clear
2012-03-31 14:00:00	2012	3	31	14:00		7.7	-8.5	31	4	48.3	101.01	Mainly Clear
2012-03-31 15:00:00	2012	3	31	15:00		7.7	-8.6	30	6	48.3	100.94	Mainly Clear
2012-03-31 16:00:00	2012	3	31	16:00		8.4	-7.7	31	4	48.3	100.89	Mainly Clear
2012-03-31 17:00:00	2012	3	31	17:00		7.9	-8.1	31	6	48.3	100.88	Mainly Clear
2012-03-31 18:00:00	2012	3	31	18:00		7.0	-8.2	33	7	48.3	100.87	Mainly Clear
2012-03-31 19:00:00	2012	3	31	19:00		5.9	-8.0	36	4	25.0	100.88	Clear
2012-03-31 20:00:00	2012	3	31	20:00		4.4	-7.2	43	9	25.0	100.85	Clear
2012-03-31 21:00:00	2012	3	31	21:00		2.6	-6.3	52	7	25.0	100.86	Clear
2012-03-31 22:00:00	2012	3	31	22:00		2.7	-6.7	50	0	25.0	100.82	Clear
2012-03-31 23:00:00	2012	3	31	23:00		1.5	-6.9	54	0	25.0	100.79	Clear

744 rows × 12 columns

We managed to clean up some of the data for March 2012. That's great, but we are also interested in the entire year's data.

In [94]:

##Pandas cookbook
def download_weather_month(year, month):
    if month == 1:
        year += 1
    url = url_template.format(year=year, month=month)
    weather_data = pd.read_csv(url, skiprows=16, index_col='Date/Time', parse_dates=True)
    weather_data = weather_data.dropna(axis=1)
    weather_data.columns = [col.replace('\xb0', '') for col in weather_data.columns]
    weather_data = weather_data.drop(['Year', 'Day', 'Month', 'Time', 'Data Quality'], axis=1)
    return weather_data

In [95]:

data_by_month = [download_weather_month(2012, i) for i in range(1, 13)]

In [96]:

#Saving to a csv
weather_2012 = pd.concat(data_by_month)
weather_2012.to_csv('weather_2012.csv')

Data Analysis¶

pandas provides vectorized string functions, to make it easy to operate on columns containing text.

In [97]:

weather_description = weather_2012['Weather']
is_snowing = weather_description.str.contains('Snow')
is_snowing.plot()

Out[97]:

<matplotlib.axes.AxesSubplot at 0x1101e9850>

Let's now try to find the month where it snowed the most, so that your company can have extra stock of those down jackets for this month.

In [98]:

weather_2012['Temp (C)'].resample('M', how=np.median).plot(kind='bar')

Out[98]:

<matplotlib.axes.AxesSubplot at 0x110c540d0>

In [99]:

is_snowing.astype(float).resample('M', how=np.mean)

Out[99]:

Date/Time
2012-02-29    0.162356
2012-03-31    0.087366
2012-04-30    0.015278
2012-05-31    0.000000
2012-06-30    0.000000
2012-07-31    0.000000
2012-08-31    0.000000
2012-09-30    0.000000
2012-10-31    0.000000
2012-11-30    0.038889
2012-12-31    0.251344
2013-01-31    0.197581
Freq: M, Name: Weather, dtype: float64

In [100]:

is_snowing.astype(float).resample('M', how=np.mean).plot(kind='bar')

Out[100]:

<matplotlib.axes.AxesSubplot at 0x110bb17d0>

Other kinds of Plotting¶

Scatter plots¶

In [101]:

df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
print df
#df.plot(kind='bar')
#df.plot(kind='bar', stacked=True)
#df.plot(kind='barh', stacked=True)
#print pd.__version__

          a         b         c         d
0  0.025943  0.770927  0.248177  0.814551
1  0.288738  0.097600  0.760169  0.612650
2  0.573980  0.047479  0.563015  0.187549
3  0.772887  0.152365  0.956512  0.201353
4  0.205687  0.157111  0.307448  0.514710
5  0.387677  0.512868  0.302177  0.391439
6  0.106364  0.453050  0.390067  0.932892
7  0.588174  0.528449  0.281689  0.425967
8  0.645613  0.104521  0.660080  0.838605
9  0.200220  0.986742  0.184672  0.943926

In [102]:

from pandas.tools.plotting import scatter_matrix
df = pd.DataFrame(np.random.randn(100, 4), columns=['a', 'b', 'c', 'd'])
scatter_matrix(df, figsize=(7, 7), diagonal='kde')

Out[102]:

array([[<matplotlib.axes.AxesSubplot object at 0x11147a0d0>,
        <matplotlib.axes.AxesSubplot object at 0x1114af6d0>,
        <matplotlib.axes.AxesSubplot object at 0x111620510>,
        <matplotlib.axes.AxesSubplot object at 0x11163dd90>],
       [<matplotlib.axes.AxesSubplot object at 0x1117a3610>,
        <matplotlib.axes.AxesSubplot object at 0x111648450>,
        <matplotlib.axes.AxesSubplot object at 0x111666690>,
        <matplotlib.axes.AxesSubplot object at 0x111689d10>],
       [<matplotlib.axes.AxesSubplot object at 0x11164c890>,
        <matplotlib.axes.AxesSubplot object at 0x112420290>,
        <matplotlib.axes.AxesSubplot object at 0x1124419d0>,
        <matplotlib.axes.AxesSubplot object at 0x11245e650>],
       [<matplotlib.axes.AxesSubplot object at 0x112485210>,
        <matplotlib.axes.AxesSubplot object at 0x11246abd0>,
        <matplotlib.axes.AxesSubplot object at 0x1124c7bd0>,
        <matplotlib.axes.AxesSubplot object at 0x112d1d0d0>]], dtype=object)

Parallel Coordinates¶

In [103]:

from pandas import read_csv
from urllib import urlopen
from pandas.tools.plotting import andrews_curves 

page = urlopen("https://raw.githubusercontent.com/pydata/pandas/master/pandas/tests/data/iris.csv")
df = read_csv(page)
andrews_curves(df, 'Name')

from pandas.tools.plotting import parallel_coordinates

#parallel_coordinates(df,'Name')

Lag plots¶

In [104]:

from pandas.tools.plotting import lag_plot
data = pd.Series(0.1 * np.random.rand(1000) + 0.9 * np.sin(np.linspace(-99 * np.pi, 99 * np.pi, num=1000)))
lag_plot(data)

Out[104]:

<matplotlib.axes.AxesSubplot at 0x113ab0d10>

References¶

Credits¶

Neal Davis and Lakshmi Rao developed these materials for Computational Science and Engineering at the University of Illinois at Urbana–Champaign.

This content is available under a [Creative Commons Attribution 3.0 Unported License](https://creativecommons.org/licenses/by/3.0/).