In God we trust, all others bring data. - The Elements of Statistical Learning
Big Data, Data Analytics, Data Science etc are the common buzzwords of the data world. So much so that data is considered to be the "new oil". There are excellent data-specific programming tools like SAS, R, Hadoop. Using a more generic scripting language like Python for data analysis is helpful as it allows for combination of data tasks with scientific programming.
One major issue for statistical programmers using Python, in the past has been the lack of libraries implementing standard models and a cohesive framework for specifying models. Pandas, the data analysis library which has been in development since 2008, aims to bridge this gap.
Pandas derives its name from panel datasets, which is a commonly used term for multi-dimensional datasets encountered in statistics and econometrics.
Data analysis is only as good as its visualization. Today we will use a number of datasets in combination with the plotting library in Python; matplotlib to demonstrate our learnings. The notebook is structured as follows:
from __future__ import division
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
pd.set_option('display.mpl_style', 'default')
#IPython magic command for inline plotting
%matplotlib inline
#a better plot shape for IPython
mpl.rcParams['figure.figsize']=[15,3]
Matplotlib is the primary plotting library in Python. We will have a separate notebook dedicated to its features in a subsequent session. For the purpose of plotting with pandas today, we will touch upon the very basic plotting in matplotlib.
x = np.linspace(0, 1, 10001)
y = np.cos(np.pi/x) * np.exp(-x**2)
plt.plot(x, y)
plt.show()
x=np.linspace(-1, 2, 10001)
y = x**2*np.exp(-x)
plt.plot(x, y)
plt.show()
The pandas data analysis module provides data structures and tools for data analysis. It focuses on data handling and manipulation as well as linear and panel regression. It is designed to let you carry out your entire data workflow in Python without having to switch to a domain-specific language such as R. Although largely compatible with NumPy/SciPy, there are some important differences in indexing, data organization, and features. The basic Pandas data type is not ndarray
, but Series and DataFrame. These allow you to index data and align axes efficiently.
A Series
object is a one-dimensional array which can hold any data type. Like a dictionary, it has a set of indices for access (like keys); unlike a dictionary, it is ordered. Data alignment is intrinsic and will not be broken unless you do it explicitly. It is very similar to ndarray from NumPy.
An arbitrary list of values can be used as the index, or a list of axis labels (so it can act something like a dict
).
s = pd.Series([1,5,float('NaN'),7.5,2.1,3])
print(s)
0 1.0 1 5.0 2 NaN 3 7.5 4 2.1 5 3.0 dtype: float64
dates = pd.date_range('20140201', periods=s.size)
s.index = dates
print(s)
2014-02-01 1.0 2014-02-02 5.0 2014-02-03 NaN 2014-02-04 7.5 2014-02-05 2.1 2014-02-06 3.0 Freq: D, dtype: float64
letters = ['A', 'B', 'Ch', '#', '#', '---']
s.index = letters
print(s)
print('\nAccess is like a dictionary key:\ns[\'---\'] = '+str(s['---']))
print('\nRepeat labels are possible:\ns[\'#\']=\n'+str(s['#']))
A 1.0 B 5.0 Ch NaN # 7.5 # 2.1 --- 3.0 dtype: float64 Access is like a dictionary key: s['---'] = 3.0 Repeat labels are possible: s['#']= # 7.5 # 2.1 dtype: float64
NumPy functions expecting an ndarray often do just fine with Series as well.
t = np.exp(s)
print(t)
A 2.718282 B 148.413159 Ch NaN # 1808.042414 # 8.166170 --- 20.085537 dtype: float64
Series is equipped with a set of string processing methods that make it easy to operate on each element of the array. Perhaps most importantly, these methods exclude missing/NA values automatically. These are accessed via the Series’s str attribute and generally have names matching the equivalent (scalar) built-in string methods:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.upper()
0 A 1 B 2 C 3 AABA 4 BACA 5 NaN 6 CABA 7 DOG 8 CAT dtype: object
s.str.lower()
0 a 1 b 2 c 3 aaba 4 baca 5 NaN 6 caba 7 dog 8 cat dtype: object
s.str.len()
0 1 1 1 2 1 3 4 4 4 5 NaN 6 4 7 3 8 3 dtype: float64
s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'])
print s2
0 a_b_c 1 c_d_e 2 NaN 3 f_g_h dtype: object
s2.str.split('_')
0 [a, b, c] 1 [c, d, e] 2 NaN 3 [f, g, h] dtype: object
Method | Description |
---|---|
cat | Concatenate strings |
split | Split strings on delimiter |
get | Index into each element (retrieve i-th element |
join | Join strings in each element of the Series with passed separator |
contains | Return boolean array if each string contains pattern/regex |
replace | Replace occurrences of pattern/regex with some other string |
repeat | Duplicate values (s.str.repeat(3) equivalent to x * 3) |
pad | Add whitespace to left, right, or both sides of strings |
center | Equivalent to pad(side='both') |
wrap | Split long strings into lines with length less than a given width |
slice | Slice each string in the Series |
slice_replace | Replace slice in each string with passed value |
count | Count occurrences of pattern |
startswith | Equivalent to str.startswith(pat) for each element |
endswith | Equivalent to str.endswith(pat) for each element |
findall | Compute list of all occurrences of pattern/regex for each string |
match | Call re.match on each element, returning matched groups as list |
extract | Call re.match on each element, as match does, but return matched groups as strings for convenience. |
len | Compute string lengths |
strip | Equivalent to str.strip |
rstrip | Equivalent to str.rstrip |
lstrip | Equivalent to str.lstrip |
lower | Equivalent to str.lower |
upper | Equivalent to str.upper |
In most data scenarios, you will receive a comma separated file, on which you will need to perform your analysis. Reading a csv
file into Python can be achieved by using the read_csv
function. We will use the date from this website, about how many people were on 7 different bike paths in Montreal, each day. Let's use the data from 2012.
broken_df = pd.read_csv('2012.csv')
#Look at the first 4 rows
broken_df[:4]
Date | Berri 1 | Brébeuf (données non disponibles) | Côte-Sainte-Catherine | Maisonneuve 1 | Maisonneuve 2 | du Parc | Pierre-Dupuy | Rachel1 | St-Urbain (données non disponibles) | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 01/01/2012 | 35 | NaN | 0 | 38 | 51 | 26 | 10 | 16 | NaN |
1 | 02/01/2012 | 83 | NaN | 1 | 68 | 153 | 53 | 6 | 43 | NaN |
2 | 03/01/2012 | 135 | NaN | 2 | 104 | 248 | 89 | 3 | 58 | NaN |
3 | 04/01/2012 | 144 | NaN | 1 | 116 | 318 | 111 | 8 | 61 | NaN |
fixed_df = pd.read_csv('2012.csv', index_col='Date')
fixed_df[:3]
Berri 1 | Brébeuf (données non disponibles) | Côte-Sainte-Catherine | Maisonneuve 1 | Maisonneuve 2 | du Parc | Pierre-Dupuy | Rachel1 | St-Urbain (données non disponibles) | |
---|---|---|---|---|---|---|---|---|---|
Date | |||||||||
01/01/2012 | 35 | NaN | 0 | 38 | 51 | 26 | 10 | 16 | NaN |
02/01/2012 | 83 | NaN | 1 | 68 | 153 | 53 | 6 | 43 | NaN |
03/01/2012 | 135 | NaN | 2 | 104 | 248 | 89 | 3 | 58 | NaN |
What we did when we read the csv
file into broken_df
, we created a 2 Dimensional data structure called a DataFrame
. The DataFrame
object is similar to a table or a spreadsheet in Excel, i.e. a 2D Matrix-like object.
s = pd.Series([1,5,float('NaN'),7.5,2.1,3])
df = pd.DataFrame(s, columns=['x'])
print(df)
x 0 1.0 1 5.0 2 NaN 3 7.5 4 2.1 5 3.0
t=np.exp(s)
df['exp(x)'] = t
df['exp(exp(x))'] = np.exp(t)
print(df)
x exp(x) exp(exp(x)) 0 1.0 2.718282 1.515426e+01 1 5.0 148.413159 2.851124e+64 2 NaN NaN NaN 3 7.5 1808.042414 inf 4 2.1 8.166170 3.519837e+03 5 3.0 20.085537 5.284913e+08
There are a number of ways to access the elements of a DataFrame
.
print(df['x'], '\n') #column
#letters = ['A', 'B', 'Ch', '#', '#', '---']
#df.index=letters
#print(df.loc['#'], '\n') #row by label
#print(df.iloc[3], '\n') #row by number (note the transposition in output!)
print(df[1:4]) #row by slice
(0 1.0 1 5.0 2 NaN 3 7.5 4 2.1 5 3.0 Name: x, dtype: float64, '\n') x exp(x) exp(exp(x)) 1 5.0 148.413159 2.851124e+64 2 NaN NaN NaN 3 7.5 1808.042414 inf
df1=pd.DataFrame(np.random.randn(dates.size,4),index=dates,columns=list('ABCD'))
print df1
A B C D 2014-02-01 -1.088830 -0.843649 0.923378 -0.857428 2014-02-02 -0.170466 -0.381519 0.437727 -0.146664 2014-02-03 -1.127080 0.098241 3.094603 0.250264 2014-02-04 -0.475779 0.803705 -0.216043 0.305970 2014-02-05 0.163283 1.145404 1.486144 0.894497 2014-02-06 -0.329314 0.235182 -0.552184 -0.436983
Using the DataFrame df1
created above, perform the following operations:
1. df1.head() and df1.tail()
2. df1.describe()
3. df1.T
4. df1.sort(columns='B')
5. df1.columns, df1.index, df1.values
df1.sort(columns=list('B'))
A | B | C | D | |
---|---|---|---|---|
2014-02-01 | -1.088830 | -0.843649 | 0.923378 | -0.857428 |
2014-02-02 | -0.170466 | -0.381519 | 0.437727 | -0.146664 |
2014-02-03 | -1.127080 | 0.098241 | 3.094603 | 0.250264 |
2014-02-06 | -0.329314 | 0.235182 | -0.552184 | -0.436983 |
2014-02-04 | -0.475779 | 0.803705 | -0.216043 | 0.305970 |
2014-02-05 | 0.163283 | 1.145404 | 1.486144 | 0.894497 |
Now let us look at the cyclist DataFrame we created. To extract a column from the DataFrame,
fixed_df['Berri 1']
Date 01/01/2012 35 02/01/2012 83 03/01/2012 135 04/01/2012 144 05/01/2012 197 06/01/2012 146 07/01/2012 98 08/01/2012 95 09/01/2012 244 10/01/2012 397 11/01/2012 273 12/01/2012 157 13/01/2012 75 14/01/2012 32 15/01/2012 54 ... 22/10/2012 3650 23/10/2012 4177 24/10/2012 3744 25/10/2012 3735 26/10/2012 4290 27/10/2012 1857 28/10/2012 1310 29/10/2012 2919 30/10/2012 2887 31/10/2012 2634 01/11/2012 2405 02/11/2012 1582 03/11/2012 844 04/11/2012 966 05/11/2012 2247 Name: Berri 1, Length: 310, dtype: int64
We can use Boolean indexing on columns to extract information satisfying our desired conditions. For example, if I wished to extract all data from the cyclist data set where the value in the column Berri 1
is greater than 1000,
fixed_df[fixed_df['Berri 1'] > 1000]
Berri 1 | Brébeuf (données non disponibles) | Côte-Sainte-Catherine | Maisonneuve 1 | Maisonneuve 2 | du Parc | Pierre-Dupuy | Rachel1 | St-Urbain (données non disponibles) | |
---|---|---|---|---|---|---|---|---|---|
Date | |||||||||
18/03/2012 | 1940 | NaN | 856 | 1036 | 1923 | 1021 | 1128 | 2477 | NaN |
19/03/2012 | 1821 | NaN | 1024 | 1278 | 2581 | 1609 | 506 | 2058 | NaN |
20/03/2012 | 2481 | NaN | 1261 | 1709 | 3130 | 1955 | 762 | 2609 | NaN |
21/03/2012 | 2829 | NaN | 1558 | 1893 | 3510 | 2225 | 993 | 2846 | NaN |
22/03/2012 | 2195 | NaN | 1030 | 1640 | 2654 | 1958 | 548 | 2254 | NaN |
23/03/2012 | 2115 | NaN | 1143 | 1512 | 2955 | 1791 | 663 | 2325 | NaN |
27/03/2012 | 1049 | NaN | 517 | 774 | 1576 | 972 | 163 | 1207 | NaN |
30/03/2012 | 1157 | NaN | 529 | 910 | 1596 | 957 | 196 | 1288 | NaN |
02/04/2012 | 1937 | NaN | 967 | 1537 | 2853 | 1614 | 394 | 2122 | NaN |
03/04/2012 | 2416 | NaN | 1078 | 1791 | 3556 | 1880 | 513 | 2450 | NaN |
04/04/2012 | 2211 | NaN | 933 | 1674 | 2956 | 1666 | 274 | 2242 | NaN |
05/04/2012 | 2424 | NaN | 1036 | 1823 | 3273 | 1699 | 355 | 2463 | NaN |
06/04/2012 | 1633 | NaN | 650 | 1045 | 1913 | 975 | 621 | 2138 | NaN |
07/04/2012 | 1208 | NaN | 494 | 739 | 1445 | 709 | 598 | 1566 | NaN |
08/04/2012 | 1164 | NaN | 560 | 621 | 1333 | 704 | 792 | 1533 | NaN |
10/04/2012 | 2183 | NaN | 909 | 1588 | 2932 | 1736 | 252 | 2108 | NaN |
11/04/2012 | 2328 | NaN | 1049 | 1765 | 3122 | 1843 | 330 | 2311 | NaN |
12/04/2012 | 3064 | NaN | 1483 | 2306 | 4076 | 2280 | 590 | 3213 | NaN |
13/04/2012 | 3341 | NaN | 1505 | 2565 | 4465 | 2358 | 922 | 3728 | NaN |
14/04/2012 | 2890 | NaN | 1072 | 1639 | 2994 | 1594 | 1284 | 3428 | NaN |
15/04/2012 | 2554 | NaN | 1210 | 1637 | 2954 | 1559 | 1846 | 3604 | NaN |
16/04/2012 | 3643 | NaN | 1841 | 2723 | 4830 | 2677 | 1061 | 3616 | NaN |
17/04/2012 | 3539 | NaN | 1616 | 2636 | 4592 | 2450 | 544 | 3333 | NaN |
18/04/2012 | 3570 | NaN | 1751 | 2759 | 4655 | 2534 | 706 | 3542 | NaN |
19/04/2012 | 4231 | NaN | 2010 | 3235 | 5311 | 2877 | 1206 | 3929 | NaN |
20/04/2012 | 2087 | NaN | 800 | 1529 | 2922 | 1531 | 170 | 2065 | NaN |
22/04/2012 | 1853 | NaN | 487 | 1224 | 1331 | 654 | 198 | 1779 | NaN |
24/04/2012 | 1810 | NaN | 720 | 1355 | 2379 | 1286 | 188 | 1753 | NaN |
25/04/2012 | 2966 | NaN | 1023 | 2228 | 3444 | 1800 | 445 | 2454 | NaN |
26/04/2012 | 2751 | NaN | 1069 | 2196 | 3546 | 1789 | 381 | 2438 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
04/10/2012 | 4034 | NaN | 2025 | 2705 | 4850 | 3066 | 555 | 3418 | NaN |
05/10/2012 | 4151 | NaN | 1977 | 2799 | 4688 | 2844 | 1035 | 4088 | NaN |
06/10/2012 | 1304 | NaN | 469 | 933 | 1589 | 776 | 236 | 1775 | NaN |
07/10/2012 | 1580 | NaN | 660 | 922 | 1629 | 860 | 695 | 2052 | NaN |
08/10/2012 | 1854 | NaN | 880 | 987 | 1818 | 1040 | 1115 | 2502 | NaN |
09/10/2012 | 4787 | NaN | 2210 | 3026 | 5138 | 3418 | 927 | 4078 | NaN |
10/10/2012 | 3115 | NaN | 1537 | 2081 | 3681 | 2608 | 560 | 2703 | NaN |
11/10/2012 | 3746 | NaN | 1857 | 2569 | 4694 | 3034 | 558 | 3457 | NaN |
12/10/2012 | 3169 | NaN | 1460 | 2261 | 4045 | 2564 | 448 | 3224 | NaN |
13/10/2012 | 1783 | NaN | 802 | 1205 | 2113 | 1183 | 681 | 2309 | NaN |
15/10/2012 | 3292 | NaN | 1678 | 2165 | 4197 | 2754 | 560 | 3183 | NaN |
16/10/2012 | 3739 | NaN | 1858 | 2684 | 4681 | 2997 | 554 | 3593 | NaN |
17/10/2012 | 4098 | NaN | 1964 | 2645 | 4836 | 3063 | 728 | 3834 | NaN |
18/10/2012 | 4671 | NaN | 2292 | 3129 | 5542 | 3477 | 1108 | 4245 | NaN |
19/10/2012 | 1313 | NaN | 597 | 885 | 1668 | 1209 | 111 | 1486 | NaN |
20/10/2012 | 2011 | NaN | 748 | 1323 | 2266 | 1213 | 797 | 2243 | NaN |
21/10/2012 | 1277 | NaN | 609 | 869 | 1777 | 898 | 242 | 1648 | NaN |
22/10/2012 | 3650 | NaN | 1819 | 2495 | 4800 | 3023 | 757 | 3721 | NaN |
23/10/2012 | 4177 | NaN | 1997 | 2795 | 5216 | 3233 | 795 | 3554 | NaN |
24/10/2012 | 3744 | NaN | 1868 | 2625 | 4900 | 3035 | 649 | 3622 | NaN |
25/10/2012 | 3735 | NaN | 1815 | 2528 | 5010 | 3017 | 631 | 3767 | NaN |
26/10/2012 | 4290 | NaN | 1987 | 2754 | 5246 | 3000 | 1456 | 4578 | NaN |
27/10/2012 | 1857 | NaN | 792 | 1244 | 2461 | 1193 | 618 | 2471 | NaN |
28/10/2012 | 1310 | NaN | 697 | 910 | 1776 | 955 | 387 | 1876 | NaN |
29/10/2012 | 2919 | NaN | 1458 | 2071 | 3768 | 2440 | 411 | 2795 | NaN |
30/10/2012 | 2887 | NaN | 1251 | 2007 | 3516 | 2255 | 338 | 2790 | NaN |
31/10/2012 | 2634 | NaN | 1294 | 1835 | 3453 | 2220 | 245 | 2570 | NaN |
01/11/2012 | 2405 | NaN | 1208 | 1701 | 3082 | 2076 | 165 | 2461 | NaN |
02/11/2012 | 1582 | NaN | 737 | 1109 | 2277 | 1392 | 97 | 1888 | NaN |
05/11/2012 | 2247 | NaN | 1170 | 1705 | 3221 | 2143 | 179 | 2430 | NaN |
218 rows × 9 columns
from pandas.util.testing import rands
df=pd.DataFrame(np.random.randn(dates.size,4),index=dates,columns=list('ABCD'))
print df
A B C D 2014-02-01 -1.054596 1.121003 -0.320041 -0.692536 2014-02-02 0.714781 -0.604180 1.067904 -1.194036 2014-02-03 -0.009586 0.361285 1.257356 2.206935 2014-02-04 0.280065 0.011517 0.602386 0.275055 2014-02-05 2.100321 1.131649 -0.251465 0.250192 2014-02-06 0.808438 -0.122169 -2.123913 -1.208994
df[df.B>0]
A | B | C | D | |
---|---|---|---|---|
2014-02-01 | -1.054596 | 1.121003 | -0.320041 | -0.692536 |
2014-02-03 | -0.009586 | 0.361285 | 1.257356 | 2.206935 |
2014-02-04 | 0.280065 | 0.011517 | 0.602386 | 0.275055 |
2014-02-05 | 2.100321 | 1.131649 | -0.251465 | 0.250192 |
df[df > 0]
A | B | C | D | |
---|---|---|---|---|
2014-02-01 | NaN | 1.121003 | NaN | NaN |
2014-02-02 | 0.714781 | NaN | 1.067904 | NaN |
2014-02-03 | NaN | 0.361285 | 1.257356 | 2.206935 |
2014-02-04 | 0.280065 | 0.011517 | 0.602386 | 0.275055 |
2014-02-05 | 2.100321 | 1.131649 | NaN | 0.250192 |
2014-02-06 | 0.808438 | NaN | NaN | NaN |
df2 = df.copy()
df2['E']=['one', 'one','two','three','four','three']
print df2
A B C D E 2014-02-01 -1.054596 1.121003 -0.320041 -0.692536 one 2014-02-02 0.714781 -0.604180 1.067904 -1.194036 one 2014-02-03 -0.009586 0.361285 1.257356 2.206935 two 2014-02-04 0.280065 0.011517 0.602386 0.275055 three 2014-02-05 2.100321 1.131649 -0.251465 0.250192 four 2014-02-06 0.808438 -0.122169 -2.123913 -1.208994 three
df2[df2['E'].isin(['one'])]
A | B | C | D | E | |
---|---|---|---|---|---|
2014-02-01 | -1.054596 | 1.121003 | -0.320041 | -0.692536 | one |
2014-02-02 | 0.714781 | -0.604180 | 1.067904 | -1.194036 | one |
df.at[dates[0],'A'] = 0
print df
A B C D 2014-02-01 0.000000 1.121003 -0.320041 -0.692536 2014-02-02 0.714781 -0.604180 1.067904 -1.194036 2014-02-03 -0.009586 0.361285 1.257356 2.206935 2014-02-04 0.280065 0.011517 0.602386 0.275055 2014-02-05 2.100321 1.131649 -0.251465 0.250192 2014-02-06 0.808438 -0.122169 -2.123913 -1.208994
df.iat[0,1] = 0
print df
A B C D 2014-02-01 0.000000 0.000000 -0.320041 -0.692536 2014-02-02 0.714781 -0.604180 1.067904 -1.194036 2014-02-03 -0.009586 0.361285 1.257356 2.206935 2014-02-04 0.280065 0.011517 0.602386 0.275055 2014-02-05 2.100321 1.131649 -0.251465 0.250192 2014-02-06 0.808438 -0.122169 -2.123913 -1.208994
from random import randint
df = pd.DataFrame({'A': [randint(1, 9) for x in xrange(10)],
'B': [randint(1, 9)*10 for x in xrange(10)],
'C': [randint(1, 9)*100 for x in xrange(10)]})
print df
A B C 0 8 30 600 1 8 10 300 2 8 80 600 3 6 80 300 4 3 10 600 5 4 90 700 6 2 50 300 7 4 40 700 8 8 40 100 9 7 70 100
Find the entries from A for which corresponding values for B will be greater than 50, and those in C equal to 900
fixed_df['Berri 1'].plot()
<matplotlib.axes.AxesSubplot at 0x10c685110>
Many data sets simply arrive with missing data, either because it exists and was not collected or it never existed. For example, in a collection of financial time series, some of the time series might start on different dates. Thus, values prior to the start date would generally be marked as missing.
As data comes in many shapes and forms, pandas
aims to be flexible with regard to handling missing data. While NaN is the default missing value marker for reasons of computational speed and convenience, we need to be able to easily detect this value with data of different types: floating point, integer, boolean, and general object. In many cases, however, the Python None will arise and we wish to also consider that “missing” or “null”.
df= pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
print df
A B C D 2014-02-01 -1.059138 -0.196474 0.239179 1.191028 2014-02-02 -0.641067 1.734050 -0.359996 -0.126219 2014-02-03 0.357383 -0.664820 -0.601961 -0.964749 2014-02-04 -0.340011 1.304989 0.717388 -0.268375 2014-02-05 -0.431061 -0.681776 0.147624 0.209896 2014-02-06 -0.326568 0.577446 0.682139 0.210614
df1 = df.reindex(index=dates[0:4],columns=list(df.columns) + ['E'])
print df1
A B C D E 2014-02-01 -1.059138 -0.196474 0.239179 1.191028 NaN 2014-02-02 -0.641067 1.734050 -0.359996 -0.126219 NaN 2014-02-03 0.357383 -0.664820 -0.601961 -0.964749 NaN 2014-02-04 -0.340011 1.304989 0.717388 -0.268375 NaN
df1.loc[dates[0]:dates[1],'E'] = 1
print df1
A B C D E 2014-02-01 -1.059138 -0.196474 0.239179 1.191028 1 2014-02-02 -0.641067 1.734050 -0.359996 -0.126219 1 2014-02-03 0.357383 -0.664820 -0.601961 -0.964749 NaN 2014-02-04 -0.340011 1.304989 0.717388 -0.268375 NaN
df1.dropna(how='all') #any
A | B | C | D | E | |
---|---|---|---|---|---|
2014-02-01 | -1.059138 | -0.196474 | 0.239179 | 1.191028 | 1 |
2014-02-02 | -0.641067 | 1.734050 | -0.359996 | -0.126219 | 1 |
2014-02-03 | 0.357383 | -0.664820 | -0.601961 | -0.964749 | NaN |
2014-02-04 | -0.340011 | 1.304989 | 0.717388 | -0.268375 | NaN |
df1.fillna(value=15)
A | B | C | D | E | |
---|---|---|---|---|---|
2014-02-01 | -1.059138 | -0.196474 | 0.239179 | 1.191028 | 1 |
2014-02-02 | -0.641067 | 1.734050 | -0.359996 | -0.126219 | 1 |
2014-02-03 | 0.357383 | -0.664820 | -0.601961 | -0.964749 | 15 |
2014-02-04 | -0.340011 | 1.304989 | 0.717388 | -0.268375 | 15 |
pd.isnull(df1)
A | B | C | D | E | |
---|---|---|---|---|---|
2014-02-01 | False | False | False | False | False |
2014-02-02 | False | False | False | False | False |
2014-02-03 | False | False | False | False | True |
2014-02-04 | False | False | False | False | True |
Missing values propogate through arithmetic operations.
df2=pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
print df2
df2.loc[dates[0]:dates[2],'B']=float('NaN')
print df2
print df1+df2
A B C D 2014-02-01 -1.654418 0.697809 0.991626 0.511788 2014-02-02 -1.163772 0.239813 -0.945453 0.192696 2014-02-03 -0.048510 -1.188224 1.718062 1.868163 2014-02-04 -0.277427 -1.509794 0.360021 2.071887 2014-02-05 0.008847 -2.179037 -0.074886 0.649411 2014-02-06 -0.019938 0.121653 1.180238 -1.312769 A B C D 2014-02-01 -1.654418 NaN 0.991626 0.511788 2014-02-02 -1.163772 NaN -0.945453 0.192696 2014-02-03 -0.048510 NaN 1.718062 1.868163 2014-02-04 -0.277427 -1.509794 0.360021 2.071887 2014-02-05 0.008847 -2.179037 -0.074886 0.649411 2014-02-06 -0.019938 0.121653 1.180238 -1.312769 A B C D E 2014-02-01 -2.713556 NaN 1.230805 1.702816 NaN 2014-02-02 -1.804838 NaN -1.305449 0.066476 NaN 2014-02-03 0.308873 NaN 1.116101 0.903414 NaN 2014-02-04 -0.617438 -0.204804 1.077410 1.803513 NaN 2014-02-05 NaN NaN NaN NaN NaN 2014-02-06 NaN NaN NaN NaN NaN
But this can be avoided by using built-in methods, that exclude missing values.
df1['A'].sum()
-1.6828330115428896
df1.mean(1)
2014-02-01 0.234919 2014-02-02 0.321353 2014-02-03 -0.468537 2014-02-04 0.353498 Freq: D, dtype: float64
df2.cumsum()
A | B | C | D | |
---|---|---|---|---|
2014-02-01 | -1.654418 | NaN | 0.991626 | 0.511788 |
2014-02-02 | -2.818190 | NaN | 0.046173 | 0.704484 |
2014-02-03 | -2.866700 | NaN | 1.764235 | 2.572647 |
2014-02-04 | -3.144127 | -1.509794 | 2.124256 | 4.644534 |
2014-02-05 | -3.135281 | -3.688830 | 2.049370 | 5.293945 |
2014-02-06 | -3.155219 | -3.567178 | 3.229609 | 3.981177 |
#Gaussian numbers histogram
from numpy.random import normal
n = 1000
x = pd.Series(normal(size=n))
#print x
avg = x.mean()
std = x.std()
x_avg = pd.Series(np.ones(n)* avg)
x_stdl = pd.Series(np.ones(n)*(avg-std))
x_stdh = pd.Series(np.ones(n)*(avg+std))
df_gauss=pd.DataFrame({'A':x_stdl,'B':x_stdh,'x':x})
df_gauss.plot(style=['rx','rx','bx'])
plt.figure()
df_gauss['x'].diff().hist(color='g', bins=50)
<matplotlib.axes.AxesSubplot at 0x10c893b10>
df=pd.DataFrame(np.random.randn(5,5), columns=list('ABCDE'))
print df
A B C D E 0 -1.353547 -0.059735 -0.597045 -0.299746 1.335253 1 -0.621872 -0.592243 0.060789 -0.366381 0.186925 2 0.382292 0.201983 0.828402 -0.869741 -0.448232 3 -2.099593 -0.471666 -0.422174 -1.474813 -0.173611 4 1.540184 -0.721423 -0.135882 -0.793001 -0.629852
Try the following with df
as defined above:
1. df.mean()
2. df.apply(np.cumsum)
3. df.apply(lambda x: x.max() - x.min())
4. Plot a histogram
df.apply(lambda x: x.max() - x.min())
#What does lambda do?
A 3.639777 B 0.923406 C 1.425447 D 1.175067 E 1.965105 dtype: float64
def f(x):
... return x*2
g = lambda x: x*2
print g(3)
6
from pandas import read_csv
from urllib import urlopen
page = urlopen("http://econpy.pythonanywhere.com/ex/NFL_1979.csv")
df = read_csv(page)
print df[:3]
Date Visitor Visitor Score Home Team \ 0 09/01/1979 Detroit Lions 16 Tampa Bay Buccaneers 1 09/02/1979 Atlanta Falcons 40 New Orleans Saints 2 09/02/1979 Baltimore Colts 0 Kansas City Chiefs Home Score Line Total Line 0 31 3 30 1 34 5 32 2 14 1 37
df1=df[0:10]
print df1
Date Visitor Visitor Score Home Team \ 0 09/01/1979 Detroit Lions 16 Tampa Bay Buccaneers 1 09/02/1979 Atlanta Falcons 40 New Orleans Saints 2 09/02/1979 Baltimore Colts 0 Kansas City Chiefs 3 09/02/1979 Cincinnati Bengals 0 Denver Broncos 4 09/02/1979 Cleveland Browns 25 New York Jets 5 09/02/1979 Dallas Cowboys 22 St Louis Cardinals 6 09/02/1979 Green Bay Packers 3 Chicago Bears 7 09/02/1979 Houston Oilers 29 Washington Redskins 8 09/02/1979 Miami Dolphins 9 Buffalo Bills 9 09/02/1979 New York Giants 17 Philadelphia Eagles Home Score Line Total Line 0 31 3 30.0 1 34 5 32.0 2 14 1 37.0 3 10 3 31.5 4 22 2 41.0 5 21 -4 37.0 6 6 3 31.0 7 27 -4 33.0 8 7 -5 39.0 9 23 7 31.5
A=df1[:3]
B=df1[3:7]
C=df1[7:10]
print A,B,C
Date Visitor Visitor Score Home Team \ 0 09/01/1979 Detroit Lions 16 Tampa Bay Buccaneers 1 09/02/1979 Atlanta Falcons 40 New Orleans Saints 2 09/02/1979 Baltimore Colts 0 Kansas City Chiefs Home Score Line Total Line 0 31 3 30 1 34 5 32 2 14 1 37 Date Visitor Visitor Score Home Team \ 3 09/02/1979 Cincinnati Bengals 0 Denver Broncos 4 09/02/1979 Cleveland Browns 25 New York Jets 5 09/02/1979 Dallas Cowboys 22 St Louis Cardinals 6 09/02/1979 Green Bay Packers 3 Chicago Bears Home Score Line Total Line 3 10 3 31.5 4 22 2 41.0 5 21 -4 37.0 6 6 3 31.0 Date Visitor Visitor Score Home Team \ 7 09/02/1979 Houston Oilers 29 Washington Redskins 8 09/02/1979 Miami Dolphins 9 Buffalo Bills 9 09/02/1979 New York Giants 17 Philadelphia Eagles Home Score Line Total Line 7 27 -4 33.0 8 7 -5 39.0 9 23 7 31.5
parts=[A,B,C]
df2=pd.concat(parts)
print df2
Date Visitor Visitor Score Home Team \ 0 09/01/1979 Detroit Lions 16 Tampa Bay Buccaneers 1 09/02/1979 Atlanta Falcons 40 New Orleans Saints 2 09/02/1979 Baltimore Colts 0 Kansas City Chiefs 3 09/02/1979 Cincinnati Bengals 0 Denver Broncos 4 09/02/1979 Cleveland Browns 25 New York Jets 5 09/02/1979 Dallas Cowboys 22 St Louis Cardinals 6 09/02/1979 Green Bay Packers 3 Chicago Bears 7 09/02/1979 Houston Oilers 29 Washington Redskins 8 09/02/1979 Miami Dolphins 9 Buffalo Bills 9 09/02/1979 New York Giants 17 Philadelphia Eagles Home Score Line Total Line 0 31 3 30.0 1 34 5 32.0 2 14 1 37.0 3 10 3 31.5 4 22 2 41.0 5 21 -4 37.0 6 6 3 31.0 7 27 -4 33.0 8 7 -5 39.0 9 23 7 31.5
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right= pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
print left
print right
key lval 0 foo 1 1 foo 2 key rval 0 foo 4 1 foo 5
pd.merge(left, right, on='key')
key | lval | rval | |
---|---|---|---|
0 | foo | 1 | 4 |
1 | foo | 1 | 5 |
2 | foo | 2 | 4 |
3 | foo | 2 | 5 |
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
rowadd=df.iloc[3]
print rowadd,df
A 0.029758 B -1.297145 C 0.322048 D -2.251859 Name: 3, dtype: float64 A B C D 0 0.096170 -0.100691 -0.083139 0.157201 1 -1.167492 -1.671596 -1.636372 0.895365 2 0.704329 -0.995257 -0.027827 1.043940 3 0.029758 -1.297145 0.322048 -2.251859 4 -0.086081 -0.144922 0.673781 -0.468736 5 -0.196530 -1.329766 -0.142091 0.161405 6 0.870045 1.492161 -1.159585 0.299559 7 1.599507 0.218105 0.400768 -0.442777
df.append(rowadd,ignore_index=True)
A | B | C | D | |
---|---|---|---|---|
0 | 0.096170 | -0.100691 | -0.083139 | 0.157201 |
1 | -1.167492 | -1.671596 | -1.636372 | 0.895365 |
2 | 0.704329 | -0.995257 | -0.027827 | 1.043940 |
3 | 0.029758 | -1.297145 | 0.322048 | -2.251859 |
4 | -0.086081 | -0.144922 | 0.673781 | -0.468736 |
5 | -0.196530 | -1.329766 | -0.142091 | 0.161405 |
6 | 0.870045 | 1.492161 | -1.159585 | 0.299559 |
7 | 1.599507 | 0.218105 | 0.400768 | -0.442777 |
8 | 0.029758 | -1.297145 | 0.322048 | -2.251859 |
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
....: 'foo', 'bar', 'foo', 'foo'],
....: 'B' : ['one', 'one', 'two', 'three',
....: 'two', 'two', 'one', 'three'],
....: 'C' : np.random.randn(8),
....: 'D' : np.random.randn(8)})
print df
A B C D 0 foo one -0.508353 -1.725045 1 bar one -0.014625 1.706129 2 foo two 0.041269 1.253173 3 bar three 1.302055 -1.497082 4 foo two 0.116896 0.007921 5 bar two -0.009417 -0.083856 6 foo one -1.478390 -0.921723 7 foo three 0.451666 -0.119239
df.groupby('A').sum()
C | D | |
---|---|---|
A | ||
bar | 1.278014 | 0.125191 |
foo | -1.376912 | -1.504914 |
df.groupby(['A','B']).sum()
C | D | ||
---|---|---|---|
A | B | ||
bar | one | -0.014625 | 1.706129 |
three | 1.302055 | -1.497082 | |
two | -0.009417 | -0.083856 | |
foo | one | -1.986743 | -2.646768 |
three | 0.451666 | -0.119239 | |
two | 0.158165 | 1.261093 |
pandas
allows for using some built-in statistical methods to compare, fit or interpolate data.
Regression analysis refers to the process of estimating relationships between variables. Linear regression is equivalent to fitting a line between to sets of data points (x,y)
$$y_i(x) = a_0 + a_1x_i $$import statsmodels.formula.api as sm
import matplotlib.pyplot as plt
url = "http://vincentarelbundock.github.com/Rdatasets/csv/HistData/Guerry.csv"
df = pd.read_csv(url)
#print df
df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
df.head()
Lottery | Literacy | Wealth | Region | |
---|---|---|---|---|
0 | 41 | 37 | 73 | E |
1 | 38 | 51 | 22 | N |
2 | 66 | 13 | 61 | C |
3 | 80 | 46 | 76 | E |
4 | 79 | 69 | 83 | E |
mod = sm.ols(formula='Lottery ~ Literacy ', data=df)
res = mod.fit()
print res.summary()
intercept, slope =res.params
OLS Regression Results ============================================================================== Dep. Variable: Lottery R-squared: 0.146 Model: OLS Adj. R-squared: 0.135 Method: Least Squares F-statistic: 14.16 Date: Mon, 09 Feb 2015 Prob (F-statistic): 0.000312 Time: 14:29:19 Log-Likelihood: -386.13 No. Observations: 85 AIC: 776.3 Df Residuals: 83 BIC: 781.2 Df Model: 1 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ Intercept 64.2389 6.163 10.423 0.000 51.981 76.497 Literacy -0.5417 0.144 -3.763 0.000 -0.828 -0.255 ============================================================================== Omnibus: 7.455 Durbin-Watson: 2.010 Prob(Omnibus): 0.024 Jarque-Bera (JB): 2.936 Skew: 0.061 Prob(JB): 0.230 Kurtosis: 2.098 Cond. No. 106. ==============================================================================
xtest=np.linspace(1,100,100)
ytest=intercept+slope*xtest
plt.plot(df['Literacy'],df['Lottery'],'kx')
plt.plot(xtest,ytest,'r')
plt.show()
The t-test assesses whether the means of two groups are statistically different from each other.
town1_heights = pd.Series([5, 6, 7, 6, 7.1, 6, 4])
town2_heights = pd.Series([5.5, 6.5, 7, 6, 7.1, 6])
town1_mean = town1_heights.mean()
town2_mean = town2_heights.mean()
print "Town 1 avg. height", town1_mean
print "Town 2 avg. height", town2_mean
print "Effect size: ", abs(town1_mean - town2_mean)
df=pd.DataFrame({'T1':town1_heights,'T2':town2_heights})
b=df.boxplot()
Town 1 avg. height 5.87142857143 Town 2 avg. height 6.35 Effect size: 0.478571428571
/Users/lrao/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/tools/plotting.py:2380: FutureWarning: The default value for 'return_type' will change to 'axes' in a future release. To use the future behavior now, set return_type='axes'. To keep the previous behavior and silence this warning, set return_type='dict'. warnings.warn(msg, FutureWarning)
from scipy import stats
print "Town 1 Shapiro-Wilks p-value", stats.shapiro(town1_heights)[1]
print " T-Test p-value:", stats.ttest_ind(town1_heights, town2_heights,equal_var = False)[1]
Town 1 Shapiro-Wilks p-value 0.380458295345 T-Test p-value: 0.347028503558
A time series is a sequence of data points, measured typically at successive points in time spaced at uniform time intervals.
rng = pd.date_range('1/1/2012', periods=100, freq='S')
print rng
<class 'pandas.tseries.index.DatetimeIndex'> [2012-01-01 00:00:00, ..., 2012-01-01 00:01:39] Length: 100, Freq: S, Timezone: None
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts.plot()
<matplotlib.axes.AxesSubplot at 0x10f5b9ed0>
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts=ts.cumsum()
ts.plot()
<matplotlib.axes.AxesSubplot at 0x10f7e3490>
#Bar plot
ts = pd.DataFrame(np.random.randn(1000,5), index=pd.date_range('1/1/2000', periods=1000))
ts=ts.cumsum()
print ts.ix[5]
ts.ix[5].plot(kind='bar'); plt.axhline(0, color='k')
0 -4.948548 1 -1.926755 2 2.653564 3 0.093642 4 -3.095961 Name: 2000-01-06 00:00:00, dtype: float64
<matplotlib.lines.Line2D at 0x10f7aee10>
Imagine yourself to be a sales analyst at an apparel company. Your boss asks you to look at weather data from the past year to understand the weather data over the months, so that you can have the right apparel on display at the appropriate time.
You can get the data from here (so your company is Canadian). The template for downloading the data is:
url_template = "http://climate.weather.gc.ca/climateData/bulkdata_e.html?format=csv&stationID=5415&Year={year}&Month={month}&timeframe=1&submit=Download+Data"
Usually in data tasks, there are no specified objectives. One needs to play around with the data in order to derive inferences. While this might seem like a vague and daunting task, it simply requires a start and once you get familiar with the data, you will eventually find some patterns and will be able to make an initial set of conclusions.
Here let's start with the data for March 2012 (there seems to be less data for the more recent years).
url = url_template.format(month=3, year=2012)
weather_mar2012 = pd.read_csv(url, skiprows=16, index_col='Date/Time', parse_dates=True, encoding='latin1')
weather_mar2012
Year | Month | Day | Time | Data Quality | Temp (°C) | Temp Flag | Dew Point Temp (°C) | Dew Point Temp Flag | Rel Hum (%) | ... | Wind Spd Flag | Visibility (km) | Visibility Flag | Stn Press (kPa) | Stn Press Flag | Hmdx | Hmdx Flag | Wind Chill | Wind Chill Flag | Weather | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Date/Time | |||||||||||||||||||||
2012-03-01 00:00:00 | 2012 | 3 | 1 | 00:00 | -5.5 | NaN | -9.7 | NaN | 72 | ... | NaN | 4.0 | NaN | 100.97 | NaN | NaN | NaN | -13 | NaN | Snow | |
2012-03-01 01:00:00 | 2012 | 3 | 1 | 01:00 | -5.7 | NaN | -8.7 | NaN | 79 | ... | NaN | 2.4 | NaN | 100.87 | NaN | NaN | NaN | -13 | NaN | Snow | |
2012-03-01 02:00:00 | 2012 | 3 | 1 | 02:00 | -5.4 | NaN | -8.3 | NaN | 80 | ... | NaN | 4.8 | NaN | 100.80 | NaN | NaN | NaN | -13 | NaN | Snow | |
2012-03-01 03:00:00 | 2012 | 3 | 1 | 03:00 | -4.7 | NaN | -7.7 | NaN | 79 | ... | NaN | 4.0 | NaN | 100.69 | NaN | NaN | NaN | -12 | NaN | Snow | |
2012-03-01 04:00:00 | 2012 | 3 | 1 | 04:00 | -5.4 | NaN | -7.8 | NaN | 83 | ... | NaN | 1.6 | NaN | 100.62 | NaN | NaN | NaN | -14 | NaN | Snow | |
2012-03-01 05:00:00 | 2012 | 3 | 1 | 05:00 | -5.3 | NaN | -7.9 | NaN | 82 | ... | NaN | 2.4 | NaN | 100.58 | NaN | NaN | NaN | -14 | NaN | Snow | |
2012-03-01 06:00:00 | 2012 | 3 | 1 | 06:00 | -5.2 | NaN | -7.8 | NaN | 82 | ... | NaN | 4.0 | NaN | 100.57 | NaN | NaN | NaN | -14 | NaN | Snow | |
2012-03-01 07:00:00 | 2012 | 3 | 1 | 07:00 | -4.9 | NaN | -7.4 | NaN | 83 | ... | NaN | 1.6 | NaN | 100.59 | NaN | NaN | NaN | -13 | NaN | Snow | |
2012-03-01 08:00:00 | 2012 | 3 | 1 | 08:00 | -5.0 | NaN | -7.5 | NaN | 83 | ... | NaN | 1.2 | NaN | 100.59 | NaN | NaN | NaN | -13 | NaN | Snow | |
2012-03-01 09:00:00 | 2012 | 3 | 1 | 09:00 | -4.9 | NaN | -7.5 | NaN | 82 | ... | NaN | 1.6 | NaN | 100.60 | NaN | NaN | NaN | -13 | NaN | Snow | |
2012-03-01 10:00:00 | 2012 | 3 | 1 | 10:00 | -4.7 | NaN | -7.3 | NaN | 82 | ... | NaN | 1.2 | NaN | 100.62 | NaN | NaN | NaN | -13 | NaN | Snow | |
2012-03-01 11:00:00 | 2012 | 3 | 1 | 11:00 | -4.4 | NaN | -6.8 | NaN | 83 | ... | NaN | 1.0 | NaN | 100.66 | NaN | NaN | NaN | -12 | NaN | Snow | |
2012-03-01 12:00:00 | 2012 | 3 | 1 | 12:00 | -4.3 | NaN | -6.8 | NaN | 83 | ... | NaN | 1.2 | NaN | 100.66 | NaN | NaN | NaN | -12 | NaN | Snow | |
2012-03-01 13:00:00 | 2012 | 3 | 1 | 13:00 | -4.3 | NaN | -6.9 | NaN | 82 | ... | NaN | 1.2 | NaN | 100.65 | NaN | NaN | NaN | -12 | NaN | Snow | |
2012-03-01 14:00:00 | 2012 | 3 | 1 | 14:00 | -3.9 | NaN | -6.6 | NaN | 81 | ... | NaN | 1.2 | NaN | 100.67 | NaN | NaN | NaN | -11 | NaN | Snow | |
2012-03-01 15:00:00 | 2012 | 3 | 1 | 15:00 | -3.3 | NaN | -6.2 | NaN | 80 | ... | NaN | 1.6 | NaN | 100.71 | NaN | NaN | NaN | -10 | NaN | Snow | |
2012-03-01 16:00:00 | 2012 | 3 | 1 | 16:00 | -2.7 | NaN | -5.7 | NaN | 80 | ... | NaN | 2.4 | NaN | 100.74 | NaN | NaN | NaN | -8 | NaN | Snow | |
2012-03-01 17:00:00 | 2012 | 3 | 1 | 17:00 | -2.9 | NaN | -5.9 | NaN | 80 | ... | NaN | 4.0 | NaN | 100.80 | NaN | NaN | NaN | -9 | NaN | Snow | |
2012-03-01 18:00:00 | 2012 | 3 | 1 | 18:00 | -3.0 | NaN | -6.0 | NaN | 80 | ... | NaN | 4.0 | NaN | 100.87 | NaN | NaN | NaN | -9 | NaN | Snow | |
2012-03-01 19:00:00 | 2012 | 3 | 1 | 19:00 | -3.6 | NaN | -6.4 | NaN | 81 | ... | NaN | 3.2 | NaN | 100.93 | NaN | NaN | NaN | -9 | NaN | Snow | |
2012-03-01 20:00:00 | 2012 | 3 | 1 | 20:00 | -3.7 | NaN | -6.4 | NaN | 81 | ... | NaN | 4.8 | NaN | 100.95 | NaN | NaN | NaN | -10 | NaN | Snow | |
2012-03-01 21:00:00 | 2012 | 3 | 1 | 21:00 | -3.9 | NaN | -6.7 | NaN | 81 | ... | NaN | 6.4 | NaN | 100.98 | NaN | NaN | NaN | -10 | NaN | Snow | |
2012-03-01 22:00:00 | 2012 | 3 | 1 | 22:00 | -4.3 | NaN | -6.9 | NaN | 82 | ... | NaN | 2.4 | NaN | 101.00 | NaN | NaN | NaN | -11 | NaN | Snow | |
2012-03-01 23:00:00 | 2012 | 3 | 1 | 23:00 | -4.3 | NaN | -7.1 | NaN | 81 | ... | NaN | 4.8 | NaN | 101.04 | NaN | NaN | NaN | -11 | NaN | Snow | |
2012-03-02 00:00:00 | 2012 | 3 | 2 | 00:00 | -4.8 | NaN | -7.3 | NaN | 83 | ... | NaN | 3.2 | NaN | 101.04 | NaN | NaN | NaN | -12 | NaN | Snow | |
2012-03-02 01:00:00 | 2012 | 3 | 2 | 01:00 | -5.3 | NaN | -7.9 | NaN | 82 | ... | NaN | 4.8 | NaN | 101.09 | NaN | NaN | NaN | -12 | NaN | Snow | |
2012-03-02 02:00:00 | 2012 | 3 | 2 | 02:00 | -5.2 | NaN | -7.8 | NaN | 82 | ... | NaN | 6.4 | NaN | 101.11 | NaN | NaN | NaN | -12 | NaN | Snow | |
2012-03-02 03:00:00 | 2012 | 3 | 2 | 03:00 | -5.5 | NaN | -7.9 | NaN | 83 | ... | NaN | 4.8 | NaN | 101.15 | NaN | NaN | NaN | -12 | NaN | Snow | |
2012-03-02 04:00:00 | 2012 | 3 | 2 | 04:00 | -5.6 | NaN | -8.2 | NaN | 82 | ... | NaN | 6.4 | NaN | 101.15 | NaN | NaN | NaN | -13 | NaN | Snow | |
2012-03-02 05:00:00 | 2012 | 3 | 2 | 05:00 | -5.5 | NaN | -8.3 | NaN | 81 | ... | NaN | 12.9 | NaN | 101.15 | NaN | NaN | NaN | -12 | NaN | Snow | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2012-03-30 18:00:00 | 2012 | 3 | 30 | 18:00 | 3.9 | NaN | -7.9 | NaN | 42 | ... | NaN | 24.1 | NaN | 101.26 | NaN | NaN | NaN | NaN | NaN | Mostly Cloudy | |
2012-03-30 19:00:00 | 2012 | 3 | 30 | 19:00 | 3.1 | NaN | -6.7 | NaN | 49 | ... | NaN | 25.0 | NaN | 101.29 | NaN | NaN | NaN | NaN | NaN | Mostly Cloudy | |
2012-03-30 20:00:00 | 2012 | 3 | 30 | 20:00 | 3.0 | NaN | -8.4 | NaN | 43 | ... | NaN | 25.0 | NaN | 101.30 | NaN | NaN | NaN | NaN | NaN | Mostly Cloudy | |
2012-03-30 21:00:00 | 2012 | 3 | 30 | 21:00 | 1.7 | NaN | -9.0 | NaN | 45 | ... | NaN | 25.0 | NaN | 101.32 | NaN | NaN | NaN | NaN | NaN | Cloudy | |
2012-03-30 22:00:00 | 2012 | 3 | 30 | 22:00 | 0.4 | NaN | -8.1 | NaN | 53 | ... | NaN | 25.0 | NaN | 101.30 | NaN | NaN | NaN | NaN | NaN | Mostly Cloudy | |
2012-03-30 23:00:00 | 2012 | 3 | 30 | 23:00 | 1.4 | NaN | -7.7 | NaN | 51 | ... | NaN | 25.0 | NaN | 101.34 | NaN | NaN | NaN | NaN | NaN | Mainly Clear | |
2012-03-31 00:00:00 | 2012 | 3 | 31 | 00:00 | 1.5 | NaN | -8.6 | NaN | 47 | ... | NaN | 25.0 | NaN | 101.33 | NaN | NaN | NaN | NaN | NaN | Mostly Cloudy | |
2012-03-31 01:00:00 | 2012 | 3 | 31 | 01:00 | 1.3 | NaN | -9.6 | NaN | 44 | ... | NaN | 25.0 | NaN | 101.31 | NaN | NaN | NaN | NaN | NaN | Mostly Cloudy | |
2012-03-31 02:00:00 | 2012 | 3 | 31 | 02:00 | 1.3 | NaN | -9.7 | NaN | 44 | ... | NaN | 25.0 | NaN | 101.29 | NaN | NaN | NaN | NaN | NaN | Cloudy | |
2012-03-31 03:00:00 | 2012 | 3 | 31 | 03:00 | 0.7 | NaN | -8.8 | NaN | 49 | ... | NaN | 25.0 | NaN | 101.30 | NaN | NaN | NaN | NaN | NaN | Cloudy | |
2012-03-31 04:00:00 | 2012 | 3 | 31 | 04:00 | -0.9 | NaN | -8.5 | NaN | 56 | ... | NaN | 25.0 | NaN | 101.32 | NaN | NaN | NaN | -5 | NaN | Cloudy | |
2012-03-31 05:00:00 | 2012 | 3 | 31 | 05:00 | -0.6 | NaN | -9.2 | NaN | 52 | ... | NaN | 25.0 | NaN | 101.30 | NaN | NaN | NaN | -5 | NaN | Cloudy | |
2012-03-31 06:00:00 | 2012 | 3 | 31 | 06:00 | -0.5 | NaN | -9.2 | NaN | 52 | ... | NaN | 48.3 | NaN | 101.32 | NaN | NaN | NaN | -5 | NaN | Cloudy | |
2012-03-31 07:00:00 | 2012 | 3 | 31 | 07:00 | -0.3 | NaN | -9.2 | NaN | 51 | ... | NaN | 48.3 | NaN | 101.32 | NaN | NaN | NaN | -5 | NaN | Cloudy | |
2012-03-31 08:00:00 | 2012 | 3 | 31 | 08:00 | 0.7 | NaN | -8.5 | NaN | 50 | ... | NaN | 48.3 | NaN | 101.33 | NaN | NaN | NaN | NaN | NaN | Cloudy | |
2012-03-31 09:00:00 | 2012 | 3 | 31 | 09:00 | 1.5 | NaN | -7.8 | NaN | 50 | ... | NaN | 48.3 | NaN | 101.34 | NaN | NaN | NaN | NaN | NaN | Mostly Cloudy | |
2012-03-31 10:00:00 | 2012 | 3 | 31 | 10:00 | 2.9 | NaN | -8.1 | NaN | 44 | ... | NaN | 48.3 | NaN | 101.30 | NaN | NaN | NaN | NaN | NaN | Mainly Clear | |
2012-03-31 11:00:00 | 2012 | 3 | 31 | 11:00 | 4.6 | NaN | -9.7 | NaN | 35 | ... | NaN | 48.3 | NaN | 101.24 | NaN | NaN | NaN | NaN | NaN | Clear | |
2012-03-31 12:00:00 | 2012 | 3 | 31 | 12:00 | 6.4 | NaN | -7.1 | NaN | 37 | ... | NaN | 48.3 | NaN | 101.16 | NaN | NaN | NaN | NaN | NaN | Clear | |
2012-03-31 13:00:00 | 2012 | 3 | 31 | 13:00 | 6.5 | NaN | -9.7 | NaN | 30 | ... | NaN | 48.3 | NaN | 101.08 | NaN | NaN | NaN | NaN | NaN | Clear | |
2012-03-31 14:00:00 | 2012 | 3 | 31 | 14:00 | 7.7 | NaN | -8.5 | NaN | 31 | ... | NaN | 48.3 | NaN | 101.01 | NaN | NaN | NaN | NaN | NaN | Mainly Clear | |
2012-03-31 15:00:00 | 2012 | 3 | 31 | 15:00 | 7.7 | NaN | -8.6 | NaN | 30 | ... | NaN | 48.3 | NaN | 100.94 | NaN | NaN | NaN | NaN | NaN | Mainly Clear | |
2012-03-31 16:00:00 | 2012 | 3 | 31 | 16:00 | 8.4 | NaN | -7.7 | NaN | 31 | ... | NaN | 48.3 | NaN | 100.89 | NaN | NaN | NaN | NaN | NaN | Mainly Clear | |
2012-03-31 17:00:00 | 2012 | 3 | 31 | 17:00 | 7.9 | NaN | -8.1 | NaN | 31 | ... | NaN | 48.3 | NaN | 100.88 | NaN | NaN | NaN | NaN | NaN | Mainly Clear | |
2012-03-31 18:00:00 | 2012 | 3 | 31 | 18:00 | 7.0 | NaN | -8.2 | NaN | 33 | ... | NaN | 48.3 | NaN | 100.87 | NaN | NaN | NaN | NaN | NaN | Mainly Clear | |
2012-03-31 19:00:00 | 2012 | 3 | 31 | 19:00 | 5.9 | NaN | -8.0 | NaN | 36 | ... | NaN | 25.0 | NaN | 100.88 | NaN | NaN | NaN | NaN | NaN | Clear | |
2012-03-31 20:00:00 | 2012 | 3 | 31 | 20:00 | 4.4 | NaN | -7.2 | NaN | 43 | ... | NaN | 25.0 | NaN | 100.85 | NaN | NaN | NaN | NaN | NaN | Clear | |
2012-03-31 21:00:00 | 2012 | 3 | 31 | 21:00 | 2.6 | NaN | -6.3 | NaN | 52 | ... | NaN | 25.0 | NaN | 100.86 | NaN | NaN | NaN | NaN | NaN | Clear | |
2012-03-31 22:00:00 | 2012 | 3 | 31 | 22:00 | 2.7 | NaN | -6.7 | NaN | 50 | ... | NaN | 25.0 | NaN | 100.82 | NaN | NaN | NaN | NaN | NaN | Clear | |
2012-03-31 23:00:00 | 2012 | 3 | 31 | 23:00 | 1.5 | NaN | -6.9 | NaN | 54 | ... | NaN | 25.0 | NaN | 100.79 | NaN | NaN | NaN | NaN | NaN | Clear |
744 rows × 24 columns
We are only interested in the temperatures, so let's go ahead and plot the column for the month of March,2012. But we also see that the Temp column has some special characters, which might be painful to reuse. So, let's fix that first!
weather_mar2012.columns = [
u'Year', u'Month', u'Day', u'Time', u'Data Quality', u'Temp (C)',
u'Temp Flag', u'Dew Point Temp (C)', u'Dew Point Temp Flag',
u'Rel Hum (%)', u'Rel Hum Flag', u'Wind Dir (10s deg)', u'Wind Dir Flag',
u'Wind Spd (km/h)', u'Wind Spd Flag', u'Visibility (km)', u'Visibility Flag',
u'Stn Press (kPa)', u'Stn Press Flag', u'Hmdx', u'Hmdx Flag', u'Wind Chill',
u'Wind Chill Flag', u'Weather']
weather_mar2012[u'Temp (C)'].plot(figsize=(15, 5))
<matplotlib.axes.AxesSubplot at 0x10fe28250>
There are also many columns with NA values. We cannot use them in any of our analyses, so we can go ahead and drop them.
weather_mar2012 = weather_mar2012.dropna(axis=1, how='any')
weather_mar2012
Year | Month | Day | Time | Data Quality | Temp (C) | Dew Point Temp (C) | Rel Hum (%) | Wind Spd (km/h) | Visibility (km) | Stn Press (kPa) | Weather | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Date/Time | ||||||||||||
2012-03-01 00:00:00 | 2012 | 3 | 1 | 00:00 | -5.5 | -9.7 | 72 | 24 | 4.0 | 100.97 | Snow | |
2012-03-01 01:00:00 | 2012 | 3 | 1 | 01:00 | -5.7 | -8.7 | 79 | 26 | 2.4 | 100.87 | Snow | |
2012-03-01 02:00:00 | 2012 | 3 | 1 | 02:00 | -5.4 | -8.3 | 80 | 28 | 4.8 | 100.80 | Snow | |
2012-03-01 03:00:00 | 2012 | 3 | 1 | 03:00 | -4.7 | -7.7 | 79 | 28 | 4.0 | 100.69 | Snow | |
2012-03-01 04:00:00 | 2012 | 3 | 1 | 04:00 | -5.4 | -7.8 | 83 | 35 | 1.6 | 100.62 | Snow | |
2012-03-01 05:00:00 | 2012 | 3 | 1 | 05:00 | -5.3 | -7.9 | 82 | 33 | 2.4 | 100.58 | Snow | |
2012-03-01 06:00:00 | 2012 | 3 | 1 | 06:00 | -5.2 | -7.8 | 82 | 33 | 4.0 | 100.57 | Snow | |
2012-03-01 07:00:00 | 2012 | 3 | 1 | 07:00 | -4.9 | -7.4 | 83 | 30 | 1.6 | 100.59 | Snow | |
2012-03-01 08:00:00 | 2012 | 3 | 1 | 08:00 | -5.0 | -7.5 | 83 | 32 | 1.2 | 100.59 | Snow | |
2012-03-01 09:00:00 | 2012 | 3 | 1 | 09:00 | -4.9 | -7.5 | 82 | 32 | 1.6 | 100.60 | Snow | |
2012-03-01 10:00:00 | 2012 | 3 | 1 | 10:00 | -4.7 | -7.3 | 82 | 32 | 1.2 | 100.62 | Snow | |
2012-03-01 11:00:00 | 2012 | 3 | 1 | 11:00 | -4.4 | -6.8 | 83 | 28 | 1.0 | 100.66 | Snow | |
2012-03-01 12:00:00 | 2012 | 3 | 1 | 12:00 | -4.3 | -6.8 | 83 | 30 | 1.2 | 100.66 | Snow | |
2012-03-01 13:00:00 | 2012 | 3 | 1 | 13:00 | -4.3 | -6.9 | 82 | 28 | 1.2 | 100.65 | Snow | |
2012-03-01 14:00:00 | 2012 | 3 | 1 | 14:00 | -3.9 | -6.6 | 81 | 28 | 1.2 | 100.67 | Snow | |
2012-03-01 15:00:00 | 2012 | 3 | 1 | 15:00 | -3.3 | -6.2 | 80 | 24 | 1.6 | 100.71 | Snow | |
2012-03-01 16:00:00 | 2012 | 3 | 1 | 16:00 | -2.7 | -5.7 | 80 | 19 | 2.4 | 100.74 | Snow | |
2012-03-01 17:00:00 | 2012 | 3 | 1 | 17:00 | -2.9 | -5.9 | 80 | 20 | 4.0 | 100.80 | Snow | |
2012-03-01 18:00:00 | 2012 | 3 | 1 | 18:00 | -3.0 | -6.0 | 80 | 19 | 4.0 | 100.87 | Snow | |
2012-03-01 19:00:00 | 2012 | 3 | 1 | 19:00 | -3.6 | -6.4 | 81 | 17 | 3.2 | 100.93 | Snow | |
2012-03-01 20:00:00 | 2012 | 3 | 1 | 20:00 | -3.7 | -6.4 | 81 | 20 | 4.8 | 100.95 | Snow | |
2012-03-01 21:00:00 | 2012 | 3 | 1 | 21:00 | -3.9 | -6.7 | 81 | 22 | 6.4 | 100.98 | Snow | |
2012-03-01 22:00:00 | 2012 | 3 | 1 | 22:00 | -4.3 | -6.9 | 82 | 22 | 2.4 | 101.00 | Snow | |
2012-03-01 23:00:00 | 2012 | 3 | 1 | 23:00 | -4.3 | -7.1 | 81 | 22 | 4.8 | 101.04 | Snow | |
2012-03-02 00:00:00 | 2012 | 3 | 2 | 00:00 | -4.8 | -7.3 | 83 | 22 | 3.2 | 101.04 | Snow | |
2012-03-02 01:00:00 | 2012 | 3 | 2 | 01:00 | -5.3 | -7.9 | 82 | 20 | 4.8 | 101.09 | Snow | |
2012-03-02 02:00:00 | 2012 | 3 | 2 | 02:00 | -5.2 | -7.8 | 82 | 19 | 6.4 | 101.11 | Snow | |
2012-03-02 03:00:00 | 2012 | 3 | 2 | 03:00 | -5.5 | -7.9 | 83 | 19 | 4.8 | 101.15 | Snow | |
2012-03-02 04:00:00 | 2012 | 3 | 2 | 04:00 | -5.6 | -8.2 | 82 | 24 | 6.4 | 101.15 | Snow | |
2012-03-02 05:00:00 | 2012 | 3 | 2 | 05:00 | -5.5 | -8.3 | 81 | 19 | 12.9 | 101.15 | Snow | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2012-03-30 18:00:00 | 2012 | 3 | 30 | 18:00 | 3.9 | -7.9 | 42 | 11 | 24.1 | 101.26 | Mostly Cloudy | |
2012-03-30 19:00:00 | 2012 | 3 | 30 | 19:00 | 3.1 | -6.7 | 49 | 7 | 25.0 | 101.29 | Mostly Cloudy | |
2012-03-30 20:00:00 | 2012 | 3 | 30 | 20:00 | 3.0 | -8.4 | 43 | 7 | 25.0 | 101.30 | Mostly Cloudy | |
2012-03-30 21:00:00 | 2012 | 3 | 30 | 21:00 | 1.7 | -9.0 | 45 | 4 | 25.0 | 101.32 | Cloudy | |
2012-03-30 22:00:00 | 2012 | 3 | 30 | 22:00 | 0.4 | -8.1 | 53 | 0 | 25.0 | 101.30 | Mostly Cloudy | |
2012-03-30 23:00:00 | 2012 | 3 | 30 | 23:00 | 1.4 | -7.7 | 51 | 6 | 25.0 | 101.34 | Mainly Clear | |
2012-03-31 00:00:00 | 2012 | 3 | 31 | 00:00 | 1.5 | -8.6 | 47 | 13 | 25.0 | 101.33 | Mostly Cloudy | |
2012-03-31 01:00:00 | 2012 | 3 | 31 | 01:00 | 1.3 | -9.6 | 44 | 13 | 25.0 | 101.31 | Mostly Cloudy | |
2012-03-31 02:00:00 | 2012 | 3 | 31 | 02:00 | 1.3 | -9.7 | 44 | 11 | 25.0 | 101.29 | Cloudy | |
2012-03-31 03:00:00 | 2012 | 3 | 31 | 03:00 | 0.7 | -8.8 | 49 | 13 | 25.0 | 101.30 | Cloudy | |
2012-03-31 04:00:00 | 2012 | 3 | 31 | 04:00 | -0.9 | -8.5 | 56 | 13 | 25.0 | 101.32 | Cloudy | |
2012-03-31 05:00:00 | 2012 | 3 | 31 | 05:00 | -0.6 | -9.2 | 52 | 13 | 25.0 | 101.30 | Cloudy | |
2012-03-31 06:00:00 | 2012 | 3 | 31 | 06:00 | -0.5 | -9.2 | 52 | 15 | 48.3 | 101.32 | Cloudy | |
2012-03-31 07:00:00 | 2012 | 3 | 31 | 07:00 | -0.3 | -9.2 | 51 | 19 | 48.3 | 101.32 | Cloudy | |
2012-03-31 08:00:00 | 2012 | 3 | 31 | 08:00 | 0.7 | -8.5 | 50 | 17 | 48.3 | 101.33 | Cloudy | |
2012-03-31 09:00:00 | 2012 | 3 | 31 | 09:00 | 1.5 | -7.8 | 50 | 17 | 48.3 | 101.34 | Mostly Cloudy | |
2012-03-31 10:00:00 | 2012 | 3 | 31 | 10:00 | 2.9 | -8.1 | 44 | 15 | 48.3 | 101.30 | Mainly Clear | |
2012-03-31 11:00:00 | 2012 | 3 | 31 | 11:00 | 4.6 | -9.7 | 35 | 7 | 48.3 | 101.24 | Clear | |
2012-03-31 12:00:00 | 2012 | 3 | 31 | 12:00 | 6.4 | -7.1 | 37 | 11 | 48.3 | 101.16 | Clear | |
2012-03-31 13:00:00 | 2012 | 3 | 31 | 13:00 | 6.5 | -9.7 | 30 | 9 | 48.3 | 101.08 | Clear | |
2012-03-31 14:00:00 | 2012 | 3 | 31 | 14:00 | 7.7 | -8.5 | 31 | 4 | 48.3 | 101.01 | Mainly Clear | |
2012-03-31 15:00:00 | 2012 | 3 | 31 | 15:00 | 7.7 | -8.6 | 30 | 6 | 48.3 | 100.94 | Mainly Clear | |
2012-03-31 16:00:00 | 2012 | 3 | 31 | 16:00 | 8.4 | -7.7 | 31 | 4 | 48.3 | 100.89 | Mainly Clear | |
2012-03-31 17:00:00 | 2012 | 3 | 31 | 17:00 | 7.9 | -8.1 | 31 | 6 | 48.3 | 100.88 | Mainly Clear | |
2012-03-31 18:00:00 | 2012 | 3 | 31 | 18:00 | 7.0 | -8.2 | 33 | 7 | 48.3 | 100.87 | Mainly Clear | |
2012-03-31 19:00:00 | 2012 | 3 | 31 | 19:00 | 5.9 | -8.0 | 36 | 4 | 25.0 | 100.88 | Clear | |
2012-03-31 20:00:00 | 2012 | 3 | 31 | 20:00 | 4.4 | -7.2 | 43 | 9 | 25.0 | 100.85 | Clear | |
2012-03-31 21:00:00 | 2012 | 3 | 31 | 21:00 | 2.6 | -6.3 | 52 | 7 | 25.0 | 100.86 | Clear | |
2012-03-31 22:00:00 | 2012 | 3 | 31 | 22:00 | 2.7 | -6.7 | 50 | 0 | 25.0 | 100.82 | Clear | |
2012-03-31 23:00:00 | 2012 | 3 | 31 | 23:00 | 1.5 | -6.9 | 54 | 0 | 25.0 | 100.79 | Clear |
744 rows × 12 columns
We managed to clean up some of the data for March 2012. That's great, but we are also interested in the entire year's data.
##Pandas cookbook
def download_weather_month(year, month):
if month == 1:
year += 1
url = url_template.format(year=year, month=month)
weather_data = pd.read_csv(url, skiprows=16, index_col='Date/Time', parse_dates=True)
weather_data = weather_data.dropna(axis=1)
weather_data.columns = [col.replace('\xb0', '') for col in weather_data.columns]
weather_data = weather_data.drop(['Year', 'Day', 'Month', 'Time', 'Data Quality'], axis=1)
return weather_data
data_by_month = [download_weather_month(2012, i) for i in range(1, 13)]
#Saving to a csv
weather_2012 = pd.concat(data_by_month)
weather_2012.to_csv('weather_2012.csv')
pandas
provides vectorized string functions, to make it easy to operate on columns containing text.
weather_description = weather_2012['Weather']
is_snowing = weather_description.str.contains('Snow')
is_snowing.plot()
<matplotlib.axes.AxesSubplot at 0x1101e9850>
Let's now try to find the month where it snowed the most, so that your company can have extra stock of those down jackets for this month.
weather_2012['Temp (C)'].resample('M', how=np.median).plot(kind='bar')
<matplotlib.axes.AxesSubplot at 0x110c540d0>
is_snowing.astype(float).resample('M', how=np.mean)
Date/Time 2012-02-29 0.162356 2012-03-31 0.087366 2012-04-30 0.015278 2012-05-31 0.000000 2012-06-30 0.000000 2012-07-31 0.000000 2012-08-31 0.000000 2012-09-30 0.000000 2012-10-31 0.000000 2012-11-30 0.038889 2012-12-31 0.251344 2013-01-31 0.197581 Freq: M, Name: Weather, dtype: float64
is_snowing.astype(float).resample('M', how=np.mean).plot(kind='bar')
<matplotlib.axes.AxesSubplot at 0x110bb17d0>
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
print df
#df.plot(kind='bar')
#df.plot(kind='bar', stacked=True)
#df.plot(kind='barh', stacked=True)
#print pd.__version__
a b c d 0 0.025943 0.770927 0.248177 0.814551 1 0.288738 0.097600 0.760169 0.612650 2 0.573980 0.047479 0.563015 0.187549 3 0.772887 0.152365 0.956512 0.201353 4 0.205687 0.157111 0.307448 0.514710 5 0.387677 0.512868 0.302177 0.391439 6 0.106364 0.453050 0.390067 0.932892 7 0.588174 0.528449 0.281689 0.425967 8 0.645613 0.104521 0.660080 0.838605 9 0.200220 0.986742 0.184672 0.943926
from pandas.tools.plotting import scatter_matrix
df = pd.DataFrame(np.random.randn(100, 4), columns=['a', 'b', 'c', 'd'])
scatter_matrix(df, figsize=(7, 7), diagonal='kde')
array([[<matplotlib.axes.AxesSubplot object at 0x11147a0d0>, <matplotlib.axes.AxesSubplot object at 0x1114af6d0>, <matplotlib.axes.AxesSubplot object at 0x111620510>, <matplotlib.axes.AxesSubplot object at 0x11163dd90>], [<matplotlib.axes.AxesSubplot object at 0x1117a3610>, <matplotlib.axes.AxesSubplot object at 0x111648450>, <matplotlib.axes.AxesSubplot object at 0x111666690>, <matplotlib.axes.AxesSubplot object at 0x111689d10>], [<matplotlib.axes.AxesSubplot object at 0x11164c890>, <matplotlib.axes.AxesSubplot object at 0x112420290>, <matplotlib.axes.AxesSubplot object at 0x1124419d0>, <matplotlib.axes.AxesSubplot object at 0x11245e650>], [<matplotlib.axes.AxesSubplot object at 0x112485210>, <matplotlib.axes.AxesSubplot object at 0x11246abd0>, <matplotlib.axes.AxesSubplot object at 0x1124c7bd0>, <matplotlib.axes.AxesSubplot object at 0x112d1d0d0>]], dtype=object)
from pandas import read_csv
from urllib import urlopen
from pandas.tools.plotting import andrews_curves
page = urlopen("https://raw.githubusercontent.com/pydata/pandas/master/pandas/tests/data/iris.csv")
df = read_csv(page)
andrews_curves(df, 'Name')
from pandas.tools.plotting import parallel_coordinates
#parallel_coordinates(df,'Name')
from pandas.tools.plotting import lag_plot
data = pd.Series(0.1 * np.random.rand(1000) + 0.9 * np.sin(np.linspace(-99 * np.pi, 99 * np.pi, num=1000)))
lag_plot(data)
<matplotlib.axes.AxesSubplot at 0x113ab0d10>
Neal Davis and Lakshmi Rao developed these materials for Computational Science and Engineering at the University of Illinois at Urbana–Champaign.
This content is available under a [Creative Commons Attribution 3.0 Unported License](https://creativecommons.org/licenses/by/3.0/).