Pandas is powerful and easy-to-use library for data analysis. Is has two main object to represents data: Series and DataFrame.
Finding Help:
NumPyBase N-dimensional array package |
SciPyFundamental library for scientific computing |
MatplotlibComprehensive 2D Plotting |
|||
IPythonEnhanced Interactive Console |
SymPySymbolic mathematics |
PandasData structures & analysis |
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Series is an array like object.
x = pd.Series([1,2,3,4,5])
x
0 1 1 2 2 3 3 4 4 5 dtype: int64
Notice that generated an index for your item
x + 100
0 101 1 102 2 103 3 104 4 105 dtype: int64
(x ** 2) + 100
0 101 1 104 2 109 3 116 4 125 dtype: int64
x > 2
0 False 1 False 2 True 3 True 4 True dtype: bool
any()
and all()
¶larger_than_2 = x > 2
larger_than_2
0 False 1 False 2 True 3 True 4 True dtype: bool
larger_than_2.any()
True
larger_than_2.all()
False
apply()
¶def f(x):
if x % 2 == 0:
return x * 2
else:
return x * 3
x.apply(f)
0 3 1 4 2 9 3 8 4 15 dtype: int64
Avoid looping over your data
This is a %%timeit
results from apply()
and a for loop.
%%timeit
ds = pd.Series(range(10000))
for counter in range(len(ds)):
ds[counter] = f(ds[counter])
1 loops, best of 3: 241 ms per loop
%%timeit
ds = pd.Series(range(10000))
ds = ds.apply(f)
10 loops, best of 3: 40 ms per loop
astype()
¶x.astype(np.float64)
0 1 1 2 2 3 3 4 4 5 dtype: float64
copy()
¶y = x
y[0]
1
y[0] = 100
y
0 100 1 2 2 3 3 4 4 5 dtype: int64
x
0 100 1 2 2 3 3 4 4 5 dtype: int64
Avoid using copy (is you can) to save memory
y = x.copy()
x[0]=1
x
0 1 1 2 2 3 3 4 4 5 dtype: int64
y
0 100 1 2 2 3 3 4 4 5 dtype: int64
x.describe(percentile_width=50)
count 5.000000 mean 3.000000 std 1.581139 min 1.000000 25% 2.000000 50% 3.000000 75% 4.000000 max 5.000000 dtype: float64
data = [1,2,3,4,5,6,7,8,9]
df = pd.DataFrame(data, columns=["x"])
df
x | |
---|---|
0 | 1 |
1 | 2 |
2 | 3 |
3 | 4 |
4 | 5 |
5 | 6 |
6 | 7 |
7 | 8 |
8 | 9 |
9 rows × 1 columns
df["x"]
0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 Name: x, dtype: int64
df["x"][0]
1
df["x_plus_2"] = df["x"] + 2
df
x | x_plus_2 | |
---|---|---|
0 | 1 | 3 |
1 | 2 | 4 |
2 | 3 | 5 |
3 | 4 | 6 |
4 | 5 | 7 |
5 | 6 | 8 |
6 | 7 | 9 |
7 | 8 | 10 |
8 | 9 | 11 |
9 rows × 2 columns
df["x_square"] = df["x"] ** 2
df["x_factorial"] = df["x"].apply(np.math.factorial)
df
x | x_plus_2 | x_square | x_factorial | |
---|---|---|---|---|
0 | 1 | 3 | 1 | 1 |
1 | 2 | 4 | 4 | 2 |
2 | 3 | 5 | 9 | 6 |
3 | 4 | 6 | 16 | 24 |
4 | 5 | 7 | 25 | 120 |
5 | 6 | 8 | 36 | 720 |
6 | 7 | 9 | 49 | 5040 |
7 | 8 | 10 | 64 | 40320 |
8 | 9 | 11 | 81 | 362880 |
9 rows × 4 columns
df["is_even"] = df["x"] % 2
df
x | x_plus_2 | x_square | x_factorial | is_even | |
---|---|---|---|---|---|
0 | 1 | 3 | 1 | 1 | 1 |
1 | 2 | 4 | 4 | 2 | 0 |
2 | 3 | 5 | 9 | 6 | 1 |
3 | 4 | 6 | 16 | 24 | 0 |
4 | 5 | 7 | 25 | 120 | 1 |
5 | 6 | 8 | 36 | 720 | 0 |
6 | 7 | 9 | 49 | 5040 | 1 |
7 | 8 | 10 | 64 | 40320 | 0 |
8 | 9 | 11 | 81 | 362880 | 1 |
9 rows × 5 columns
map()
¶df["odd_even"] = df["is_even"].map({1:"odd", 0:"even"})
df
x | x_plus_2 | x_square | x_factorial | is_even | odd_even | |
---|---|---|---|---|---|---|
0 | 1 | 3 | 1 | 1 | 1 | odd |
1 | 2 | 4 | 4 | 2 | 0 | even |
2 | 3 | 5 | 9 | 6 | 1 | odd |
3 | 4 | 6 | 16 | 24 | 0 | even |
4 | 5 | 7 | 25 | 120 | 1 | odd |
5 | 6 | 8 | 36 | 720 | 0 | even |
6 | 7 | 9 | 49 | 5040 | 1 | odd |
7 | 8 | 10 | 64 | 40320 | 0 | even |
8 | 9 | 11 | 81 | 362880 | 1 | odd |
9 rows × 6 columns
drop()
¶df = df.drop("is_even", 1)
df
x | x_plus_2 | x_square | x_factorial | odd_even | |
---|---|---|---|---|---|
0 | 1 | 3 | 1 | 1 | odd |
1 | 2 | 4 | 4 | 2 | even |
2 | 3 | 5 | 9 | 6 | odd |
3 | 4 | 6 | 16 | 24 | even |
4 | 5 | 7 | 25 | 120 | odd |
5 | 6 | 8 | 36 | 720 | even |
6 | 7 | 9 | 49 | 5040 | odd |
7 | 8 | 10 | 64 | 40320 | even |
8 | 9 | 11 | 81 | 362880 | odd |
9 rows × 5 columns
df[["x", "odd_even"]]
x | odd_even | |
---|---|---|
0 | 1 | odd |
1 | 2 | even |
2 | 3 | odd |
3 | 4 | even |
4 | 5 | odd |
5 | 6 | even |
6 | 7 | odd |
7 | 8 | even |
8 | 9 | odd |
9 rows × 2 columns
pd.options.display.max_columns= 60
pd.options.display.max_rows= 6
pd.options.display.notebook_repr_html = False
df
x x_plus_2 x_square x_factorial odd_even 0 1 3 1 1 odd 1 2 4 4 2 even 2 3 5 9 6 odd 3 4 6 16 24 even 4 5 7 25 120 odd 5 6 8 36 720 even .. ... ... ... ... [9 rows x 5 columns]
df[df["odd_even"] == "odd"]
x x_plus_2 x_square x_factorial odd_even 0 1 3 1 1 odd 2 3 5 9 6 odd 4 5 7 25 120 odd 6 7 9 49 5040 odd 8 9 11 81 362880 odd [5 rows x 5 columns]
df[df.odd_even == "even"]
x x_plus_2 x_square x_factorial odd_even 1 2 4 4 2 even 3 4 6 16 24 even 5 6 8 36 720 even 7 8 10 64 40320 even [4 rows x 5 columns]
|
OR¶df[(df.odd_even == "even") | (df.x_square < 20)]
x x_plus_2 x_square x_factorial odd_even 0 1 3 1 1 odd 1 2 4 4 2 even 2 3 5 9 6 odd 3 4 6 16 24 even 5 6 8 36 720 even 7 8 10 64 40320 even [6 rows x 5 columns]
&
AND¶df[(df.odd_even == "even") & (df.x_square < 20)]
x x_plus_2 x_square x_factorial odd_even 1 2 4 4 2 even 3 4 6 16 24 even [2 rows x 5 columns]
df[(df.odd_even == "even") & (df.x_square < 20)]["x_plus_2"][:1]
1 4 Name: x_plus_2, dtype: int64
scatter_matrix()
¶pd.scatter_matrix(df, diagonal="kde", figsize=(10,10));
df.describe()
x x_plus_2 x_square x_factorial count 9.000000 9.000000 9.000000 9.000000 mean 5.000000 7.000000 31.666667 45457.000000 std 2.738613 2.738613 28.080242 119758.341137 min 1.000000 3.000000 1.000000 1.000000 25% 3.000000 5.000000 9.000000 6.000000 50% 5.000000 7.000000 25.000000 120.000000 ... ... ... ... [8 rows x 4 columns]
url = "http://www.google.com/finance/historical?q=TADAWUL:TASI&output=csv"
stocks_data = pd.read_csv(url)
stocks_data
Date Open High Low Close Volume 0 11-Aug-14 10579.12 10603.30 10547.21 10596.55 197234714 1 10-Aug-14 10552.48 10614.11 10551.77 10579.12 199773735 2 7-Aug-14 10478.34 10585.38 10478.34 10552.48 202329194 3 6-Aug-14 10450.52 10494.12 10398.25 10478.34 192868941 4 5-Aug-14 10405.81 10501.38 10405.81 10450.52 287651475 5 4-Aug-14 10302.88 10409.47 10290.95 10405.81 223099538 ... ... ... ... ... ... [241 rows x 6 columns]
stocks_data["change_amount"] = stocks_data["Close"] - stocks_data["Open"]
stocks_data["change_percentage"] = stocks_data["change_amount"] / stocks_data["Close"]
stocks_data
Date Open High Low Close Volume \ 0 11-Aug-14 10579.12 10603.30 10547.21 10596.55 197234714 1 10-Aug-14 10552.48 10614.11 10551.77 10579.12 199773735 2 7-Aug-14 10478.34 10585.38 10478.34 10552.48 202329194 3 6-Aug-14 10450.52 10494.12 10398.25 10478.34 192868941 4 5-Aug-14 10405.81 10501.38 10405.81 10450.52 287651475 5 4-Aug-14 10302.88 10409.47 10290.95 10405.81 223099538 ... ... ... ... ... ... change_amount change_percentage 0 17.43 0.001645 1 26.64 0.002518 2 74.14 0.007026 3 27.82 0.002655 4 44.71 0.004278 5 102.93 0.009892 ... ... [241 rows x 8 columns]