Tutorial Brief¶

Pandas is powerful and easy-to-use library for data analysis. Is has two main object to represents data: Series and DataFrame.

Finding Help:

	NumPy Base N-dimensional array package		SciPy Fundamental library for scientific computing		Matplotlib Comprehensive 2D Plotting
	IPython Enhanced Interactive Console		SymPy Symbolic mathematics		Pandas Data structures & analysis

Import libraries¶

In [1]:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Working with Series¶

Series is an array like object.

pd.Series(self, data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)¶

In [2]:

x = pd.Series([1,2,3,4,5])
x

Out[2]:

0    1
1    2
2    3
3    4
4    5
dtype: int64

Notice that generated an index for your item

Basic Operation¶

In [3]:

x + 100

Out[3]:

0    101
1    102
2    103
3    104
4    105
dtype: int64

In [4]:

(x ** 2) + 100

Out[4]:

0    101
1    104
2    109
3    116
4    125
dtype: int64

In [5]:

x > 2

Out[5]:

0    False
1    False
2     True
3     True
4     True
dtype: bool

`any()` and `all()`¶

In [6]:

larger_than_2 = x > 2
larger_than_2

Out[6]:

0    False
1    False
2     True
3     True
4     True
dtype: bool

In [7]:

larger_than_2.any()

Out[7]:

True

In [8]:

larger_than_2.all()

Out[8]:

False

`apply()`¶

In [9]:

def f(x):
    if x % 2 == 0:
        return x * 2
    else:
        return x * 3

x.apply(f)

Out[9]:

0     3
1     4
2     9
3     8
4    15
dtype: int64

Avoid looping over your data

This is a %%timeit results from apply() and a for loop.

In [10]:

%%timeit

ds = pd.Series(range(10000))

for counter in range(len(ds)):
    ds[counter] = f(ds[counter])

1 loops, best of 3: 241 ms per loop

In [11]:

%%timeit

ds = pd.Series(range(10000))

ds = ds.apply(f)

10 loops, best of 3: 40 ms per loop

`astype()`¶

In [12]:

x.astype(np.float64)

Out[12]:

0    1
1    2
2    3
3    4
4    5
dtype: float64

`copy()`¶

In [13]:

y = x

In [14]:

y[0]

Out[14]:

In [15]:

y[0] = 100

In [16]:

Out[16]:

0    100
1      2
2      3
3      4
4      5
dtype: int64

In [17]:

Out[17]:

0    100
1      2
2      3
3      4
4      5
dtype: int64

Avoid using copy (is you can) to save memory

In [18]:

y = x.copy()

In [19]:

x[0]=1

In [20]:

Out[20]:

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [21]:

Out[21]:

0    100
1      2
2      3
3      4
4      5
dtype: int64

In [22]:

x.describe(percentile_width=50)

Out[22]:

count    5.000000
mean     3.000000
std      1.581139
min      1.000000
25%      2.000000
50%      3.000000
75%      4.000000
max      5.000000
dtype: float64

DataFrame¶

pd.DataFrame(self, data=None, index=None, columns=None, dtype=None, copy=False)¶

In [23]:

data = [1,2,3,4,5,6,7,8,9]
df = pd.DataFrame(data, columns=["x"])

In [24]:

df

Out[24]:

	x
0	1
1	2
2	3
3	4
4	5
5	6
6	7
7	8
8	9

9 rows × 1 columns

Selecting Data¶

In [25]:

df["x"]

Out[25]:

0    1
1    2
2    3
3    4
4    5
5    6
6    7
7    8
8    9
Name: x, dtype: int64

In [26]:

df["x"][0]

Out[26]:

Adding extra columns¶

In [27]:

df["x_plus_2"] = df["x"] + 2
df

Out[27]:

	x	x_plus_2
0	1	3
1	2	4
2	3	5
3	4	6
4	5	7
5	6	8
6	7	9
7	8	10
8	9	11

9 rows × 2 columns

In [28]:

df["x_square"] = df["x"] ** 2
df["x_factorial"] = df["x"].apply(np.math.factorial)
df

Out[28]:

	x	x_plus_2	x_square	x_factorial
0	1	3	1	1
1	2	4	4	2
2	3	5	9	6
3	4	6	16	24
4	5	7	25	120
5	6	8	36	720
6	7	9	49	5040
7	8	10	64	40320
8	9	11	81	362880

9 rows × 4 columns

In [29]:

df["is_even"] = df["x"] % 2
df

Out[29]:

	x	x_plus_2	x_square	x_factorial	is_even
0	1	3	1	1	1
1	2	4	4	2	0
2	3	5	9	6	1
3	4	6	16	24	0
4	5	7	25	120	1
5	6	8	36	720	0
6	7	9	49	5040	1
7	8	10	64	40320	0
8	9	11	81	362880	1

9 rows × 5 columns

`map()`¶

In [30]:

df["odd_even"] = df["is_even"].map({1:"odd", 0:"even"})
df

Out[30]:

	x	x_plus_2	x_square	x_factorial	is_even	odd_even
0	1	3	1	1	1	odd
1	2	4	4	2	0	even
2	3	5	9	6	1	odd
3	4	6	16	24	0	even
4	5	7	25	120	1	odd
5	6	8	36	720	0	even
6	7	9	49	5040	1	odd
7	8	10	64	40320	0	even
8	9	11	81	362880	1	odd

9 rows × 6 columns

`drop()`¶

In [31]:

df = df.drop("is_even", 1)
df

Out[31]:

	x	x_plus_2	x_square	x_factorial	odd_even
0	1	3	1	1	odd
1	2	4	4	2	even
2	3	5	9	6	odd
3	4	6	16	24	even
4	5	7	25	120	odd
5	6	8	36	720	even
6	7	9	49	5040	odd
7	8	10	64	40320	even
8	9	11	81	362880	odd

9 rows × 5 columns

Multi Column Select¶

In [32]:

df[["x", "odd_even"]]

Out[32]:

	x	odd_even
0	1	odd
1	2	even
2	3	odd
3	4	even
4	5	odd
5	6	even
6	7	odd
7	8	even
8	9	odd

9 rows × 2 columns

Controlling display options¶

In [33]:

pd.options.display.max_columns= 60
pd.options.display.max_rows= 6
pd.options.display.notebook_repr_html = False
df

Out[33]:

   x  x_plus_2  x_square  x_factorial odd_even
0  1         3         1            1      odd
1  2         4         4            2     even
2  3         5         9            6      odd
3  4         6        16           24     even
4  5         7        25          120      odd
5  6         8        36          720     even
  ..       ...       ...          ...      ...

[9 rows x 5 columns]

Filtering¶

In [34]:

df[df["odd_even"] == "odd"]

Out[34]:

   x  x_plus_2  x_square  x_factorial odd_even
0  1         3         1            1      odd
2  3         5         9            6      odd
4  5         7        25          120      odd
6  7         9        49         5040      odd
8  9        11        81       362880      odd

[5 rows x 5 columns]

In [35]:

df[df.odd_even == "even"]

Out[35]:

   x  x_plus_2  x_square  x_factorial odd_even
1  2         4         4            2     even
3  4         6        16           24     even
5  6         8        36          720     even
7  8        10        64        40320     even

[4 rows x 5 columns]

Chaining Filters¶

`|` OR¶

In [36]:

df[(df.odd_even == "even") | (df.x_square < 20)]

Out[36]:

   x  x_plus_2  x_square  x_factorial odd_even
0  1         3         1            1      odd
1  2         4         4            2     even
2  3         5         9            6      odd
3  4         6        16           24     even
5  6         8        36          720     even
7  8        10        64        40320     even

[6 rows x 5 columns]

`&` AND¶

In [37]:

df[(df.odd_even == "even") & (df.x_square < 20)]

Out[37]:

   x  x_plus_2  x_square  x_factorial odd_even
1  2         4         4            2     even
3  4         6        16           24     even

[2 rows x 5 columns]

Furter Chaining¶

In [38]:

df[(df.odd_even == "even") & (df.x_square < 20)]["x_plus_2"][:1]

Out[38]:

1    4
Name: x_plus_2, dtype: int64

`scatter_matrix()`¶

In [39]:

pd.scatter_matrix(df, diagonal="kde", figsize=(10,10));

In [40]:

df.describe()

Out[40]:

              x  x_plus_2   x_square    x_factorial
count  9.000000  9.000000   9.000000       9.000000
mean   5.000000  7.000000  31.666667   45457.000000
std    2.738613  2.738613  28.080242  119758.341137
min    1.000000  3.000000   1.000000       1.000000
25%    3.000000  5.000000   9.000000       6.000000
50%    5.000000  7.000000  25.000000     120.000000
            ...       ...        ...            ...

[8 rows x 4 columns]

Reading Data from CSV/TSV Files¶

In [41]:

url = "http://www.google.com/finance/historical?q=TADAWUL:TASI&output=csv"
stocks_data = pd.read_csv(url)

In [42]:

stocks_data

Out[42]:

       Date      Open      High       Low     Close     Volume
0  11-Aug-14  10579.12  10603.30  10547.21  10596.55  197234714
1  10-Aug-14  10552.48  10614.11  10551.77  10579.12  199773735
2   7-Aug-14  10478.34  10585.38  10478.34  10552.48  202329194
3   6-Aug-14  10450.52  10494.12  10398.25  10478.34  192868941
4   5-Aug-14  10405.81  10501.38  10405.81  10450.52  287651475
5   4-Aug-14  10302.88  10409.47  10290.95  10405.81  223099538
         ...       ...       ...       ...       ...        ...

[241 rows x 6 columns]

In [43]:

stocks_data["change_amount"] = stocks_data["Close"] - stocks_data["Open"]
stocks_data["change_percentage"] = stocks_data["change_amount"] / stocks_data["Close"]
stocks_data

Out[43]:

       Date      Open      High       Low     Close     Volume  \
0  11-Aug-14  10579.12  10603.30  10547.21  10596.55  197234714   
1  10-Aug-14  10552.48  10614.11  10551.77  10579.12  199773735   
2   7-Aug-14  10478.34  10585.38  10478.34  10552.48  202329194   
3   6-Aug-14  10450.52  10494.12  10398.25  10478.34  192868941   
4   5-Aug-14  10405.81  10501.38  10405.81  10450.52  287651475   
5   4-Aug-14  10302.88  10409.47  10290.95  10405.81  223099538   
         ...       ...       ...       ...       ...        ...   

   change_amount  change_percentage  
0          17.43           0.001645  
1          26.64           0.002518  
2          74.14           0.007026  
3          27.82           0.002655  
4          44.71           0.004278  
5         102.93           0.009892  
             ...                ...  

[241 rows x 8 columns]

Tutorial Brief¶

NumPy

SciPy

Matplotlib

IPython

SymPy

Pandas