If necessary, from our lab03 submission directory type
ipython notebook
from the IPython Dashboard open a new notebook. Change the title to "Numpy and Pandas"
from numpy import * #Load all the numpy packages
import * loads all sub module and is wasteful of memory when incorporated into deployed code. We use it here by example -- and its fine to use for learning purposes, legibility, etc.
As we'll see later, the the convention is to use:
import numpy as np
And then to specifically call needed methods:
An array object represents a multidimensional, homogeneous array of fixed-size items.
# Creating arrays
a = zeros((3))
b = ones((2,3))
c = random.randint(1,10,(2,3,4))
d = arange(0,11,1)
What are these functions?
arange?
# Note the way each array is printed:
a,b,c,d
## Arithmetic in arrays is element wise
>>> a = array( [20,30,40,50] )
>>> b = arange( 4 )
>>> b
>>> c = a-b
>>> c
>>> b**2
# one-dimensional arrays work like lists:
a = arange(10)**2
a
a[2:5]
# Multidimensional arrays use tuples with commas for indexing
# with (row,column) conventions beginning, as always in Python, from 0
b = random.randint(1,100,(4,4))
b
# Guess the output
print(b[2,3])
print(b[0,0])
b[0:3,1],b[:,1]
b[1:3,:]
pandas.pydata.org
Source: pandas.pydata.org
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dates = pd.date_range('20140101',periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
# Index, columns, underlying numpy data
df
df2 = pd.DataFrame({ 'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3] * 4,dtype='int32'),
'E' : 'foo' })
df2
# With specific dtypes
df2.dtypes
df.head()
df.tail()
df.index
df.describe()
df.sort(columns='B')
df['A']
df[0:3]
# By label
df.loc[dates[0]]
# multi-axis by label
df.loc[:,['A','B']]
# Date Range
df.loc['20140102':'20140104',['B']]
# Fast access to scalar
df.at[dates[1],'B']
# iloc provides integer locations similar to np style
df[df.A < 0] # Basically a 'where' operation
df_posA = df.copy() # Without "copy" it would act on the dataset
df_posA[df_posA.A < 0] = -1*df_posA
df_posA
#Setting new column aligns data by index
s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20140102',periods=6))
s1
df['F'] = s1
df
# Add a column with missing data
df1 = df.reindex(index=dates[0:4],columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1
# find where values are null
pd.isnull(df1)
df.describe()
df.mean(),df.mean(1) # Operation on two different axes
df
df.apply(np.cumsum)
df.apply(lambda x: x.max() - x.min())
# Built in string methods
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()
random.randn(10,4)
#Concatenating pandas objects together
df = pd.DataFrame(np.random.randn(10,4))
df
# Break it into pieces
pieces = [df[:3], df[3:7],df[7:]]
pd.concat(pieces)
# Also can "Join" and "Append"
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
df
df.groupby(['A','B']).sum()
# You can also stack or unstack levels
a = df.groupby(['A','B']).sum()
# Pivot Tables
pd.pivot_table(df,values=['C','D'],rows=['A'],cols=['B'])
import pandas as pd
import numpy as np
# 100 Seconds starting on January 1st
rng = pd.date_range('1/1/2014', periods=100, freq='S')
# Give each second a random value
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts
# Built in resampling
ts.resample('1Min',how='mean') # Resample secondly to 1Minutely
# Many additional time series features
ts. #use tab
ts.plot()
def randwalk(startdate,points):
ts = pd.Series(np.random.randn(points), index=pd.date_range(startdate, periods=points))
ts=ts.cumsum()
ts.plot()
return(ts)
# Using pandas to make a simple random walker by repeatedly running:
a=randwalk('1/1/2012',1000)
# Pandas plot function will print with labels as default
df = pd.DataFrame(np.random.randn(100, 4), index=ts.index,columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure();df.plot();plt.legend(loc='best') #
I/O is straightforward with, for example, pd.read_csv or df.to_csv
Let's look under x's in plt modules
Recommended Resources
Name | Description |
---|---|
Official Pandas Tutorials | Wes & Company's selection of tutorials and lectures |
Julia Evans Pandas Cookbook | Great resource with examples from weather, bikes and 311 calls |
Learn Pandas Tutorials | A great series of Pandas tutorials from Dave Rojas |
Research Computing Python Data PYNBs | A super awesome set of python notebooks from a meetup-based course exclusively devoted to pandas |