#!/usr/bin/env python
# coding: utf-8

# This notebook will cover the assumed knowledge of pandas.
# Here's a few questions to check if you already know the material in this notebook.
# 
# 1. Does a NumPy array have a single dtype or multiple dtypes?
# 2. Why is broadcasting useful?
# 3. How do you slice a DataFrame by row label?
# 4. How do you select a column of a DataFrame?
# 5. Is the Index a column in the DataFrame?
# 
# If you feel pretty comfortable with those, go ahead and skip this notebook.
# [Answers](#Answers) are at the end. We'll meet up at the next notebook.

# # Aside: IPython Notebook
# 
# - two modes command and edit
#   - command -> edit: `Enter`
#   - edit -> command: `Esc`
# - `h` : Keyboard Shortcuts:  (from command mode)
# - `j` / `k` : navigate cells
# - `shift+Enter` executes a cell

# Outline:
# 
# - [NumPy Foundation](#NumPy-Foundation)
# - [Pandas](#Pandas)
# - [Data Structures](#Data-Structures)
# 
# ## Numpy Foundation
# 
# pandas is built atop NumPy, historically and in the actual library.
# It's helpful to have a good understanding of some NumPyisms. [Speak the vernacular](https://www.youtube.com/watch?v=u2yvNw49AX4).
# 
# ### ndarray
# 
# The core of numpy is the `ndarray`, N-dimensional array. These are homogenously-typed, fixed-length data containers.
# NumPy also provides many convenient and fast methods implemented on the `ndarray`.

# In[1]:


from __future__ import print_function

import numpy as np
import pandas as pd

x = np.array([1, 2, 3])
x


# In[2]:


x.dtype


# In[3]:


y = np.array([[True, False], [False, True]])
y


# In[4]:


y.shape


# ### dtypes
# 
# Unlike python lists, NumPy arrays care about the type of data stored within.
# The full list of NumPy dtypes can be found in the [NumPy documentation](http://docs.scipy.org/doc/numpy/user/basics.types.html).
# 
# ![dtypes](http://docs.scipy.org/doc/numpy/_images/dtype-hierarchy.png)
# 
# We sacrifice the convinience of mixing bools and ints and floats within an array for much better performance.
# However, an unexpected `dtype` change will probably bite you at some point in the future.
# 
# The two biggest things to remember are
# 
# - Missing values (NaN) cast integer or boolean arrays to floats
# - the object dtype is the fallback
# 
# You'll want to avoid object dtypes. It's typically slow.

# ### Broadcasting
# 
# It's super cool and super useful. The one-line explanation is that when doing elementwise operations, things expand to the "correct" shape.

# In[5]:


# add a scalar to a 1-d array
x = np.arange(5)
print('x:  ', x)
print('x+1:', x + 1, end='\n\n')

y = np.random.uniform(size=(2, 5))
print('y:  ', y,  sep='\n')
print('y+1:', y + 1, sep='\n')


# Since `x` is shaped `(5,)` and `y` is shaped `(2,5)` we can do operations between them.

# In[6]:


x * y


# Without broadcasting we'd have to manually reshape our arrays, which quickly gets annoying.

# In[7]:


x.reshape(1, 5).repeat(2, axis=0) * y


# # Pandas
# 
# We'll breeze through the basics here, and get onto some interesting applications in a bit. I want to provide the *barest* of intuition so things stick down the road.
# 
# ## Why pandas
# 
# NumPy is great. But it lacks a few things that are conducive to doing statisitcal analysis. By building on top of NumPy, pandas provides
# 
# - labeled arrays
# - heterogenous data types within a table
# - better missing data handling
# - convenient methods 
# - more data types (Categorical, Datetime)
# 
# ## Data Structures
# 
# This is the typical starting point for any intro to pandas.
# We'll follow suit.
# 
# ### The DataFrame
# 
# Here we have the workhorse data structure for pandas.
# It's an in-memory table holding your data, and provides a few conviniences over lists of lists or NumPy arrays.

# In[8]:


import numpy as np
import pandas as pd


# In[9]:


# Many ways to construct a DataFrame
# We pass a dict of {column name: column values}
np.random.seed(42)
df = pd.DataFrame({'A': [1, 2, 3], 'B': [True, True, False],
                   'C': np.random.randn(3)},
                  index=['a', 'b', 'c'])  # also this weird index thing
df


# In[10]:


from IPython.display import Image

Image('dataframe.png')


# ### Selecting
# 
# Our first improvement over numpy arrays is labeled indexing. We can select subsets by column, row, or both. Column selection uses the regular python `__getitem__` machinery. Pass in a single column label `'A'` or a list of labels `['A', 'C']` to select subsets of the original `DataFrame`.

# In[11]:


# Single column, reduces to a Series
df['A']


# In[12]:


cols = ['A', 'C']
df[cols]


# For row-wise selection, use the special `.loc` accessor.

# In[13]:


df.loc[['a', 'b']]


# When your index labels are ordered, you can use ranges to select rows or columns.

# In[14]:


df.loc['a':'b']


# Notice that the slice is *inclusive* on both sides,  unlike your typical slicing of a list. Sometimes, you'd rather slice by *position* instead of label. `.iloc` has you covered:

# In[15]:


df.iloc[0:2]


# This follows the usual python slicing rules: closed on the left, open on the right.
# 
# As I mentioned, you can slice both rows and columns. Use `.loc` for label or `.iloc` for position indexing.

# In[16]:


df.loc['a', 'B']


# Pandas, like NumPy, will reduce dimensions when possible. Select a single column and you get back `Series` (see below). Select a single row and single column, you get a scalar.
# 
# You can get pretty fancy:

# In[17]:


df.loc['a':'b', ['A', 'C']]


# #### Summary
# 
# - Use `[]` for selecting columns
# - Use `.loc[row_lables, column_labels]` for label-based indexing
# - Use `.iloc[row_positions, column_positions]` for positional index
# 
# I've left out boolean and hierarchical indexing, which we'll see later.

# ## Series
# 
# You've already seen some `Series` up above. It's the 1-dimensional analog of the DataFrame. Each column in a `DataFrame` is in some sense a `Series`. You can select a `Series` from a DataFrame in a few ways:

# In[18]:


# __getitem__ like before
df['A']


# In[19]:


# .loc, like before
df.loc[:, 'A']


# In[20]:


# using `.` attribute lookup
df.A


# You'll have to be careful with the last one. It won't work if you're column name isn't a valid python identifier (say it has a space) or if it conflicts with one of the (many) methods on `DataFrame`. The `.` accessor is extremely convient for interactive use though.
# 
# You should never *assign* a column with `.` e.g. don't do
# 
# ```python
# # bad
# df.A = [1, 2, 3]
# ```
# 
# It's unclear whether your attaching the list `[1, 2, 3]` as an attirbute of `df`, or whether you want it as a column. It's better to just say
# 
# ```python
# df['A'] = [1, 2, 3]
# # or
# df.loc[:, 'A'] = [1, 2, 3]
# ```
# 
# `Series` share many of the same methods as `DataFrame`s.

# ## Index
# 
# `Index`es are something of a peculiarity to pandas.
# First off, they are not the kind of indexes you'll find in SQL, which are used to help the engine speed up certain queries.
# In pandas, `Index`es are about lables. This helps with selection (like we did above) and automatic alignment when performing operations between two `DataFrame`s or `Series`.
# 
# R does have row labels, but they're nowhere near as powerful (or complicated) as in pandas. You can access the index of a `DataFrame` or `Series` with the `.index` attribute.

# In[21]:


df.index


# There are special kinds of `Index`es that you'll come across. Some of these are
# 
# - `MultiIndex` for multidimensional (Hierarchical) labels
# - `DatetimeIndex` for datetimes
# - `Float64Index` for floats
# - `CategoricalIndex` for, you guessed it, `Categorical`s
# 
# We'll talk *a lot* more about indexes. They're a complex topic and can introduce headaches.
# 
# <blockquote class="twitter-tweet" lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/gjreda">@gjreda</a> <a href="https://twitter.com/treycausey">@treycausey</a> in some cases row indexes are the best thing since sliced bread, in others they simply get in the way. Hard problem</p>&mdash; Wes McKinney (@wesmckinn) <a href="https://twitter.com/wesmckinn/status/547177248768659457">December 22, 2014</a></blockquote>
# 
# Pandas, for better or for worse, does usually provide ways around row indexes being obstacles. The problem is knowing *when* they are just getting in the way, which mostly comes by experience. Sorry.

# # Answers
# 
# 1. Does a NumPy array have a single dtype or multiple dtypes?
#   - NumPy arrays are homogenous: they only have a single dtype (unlike DataFrames).
#   You can have an array that holds mixed types, e.g. `np.array(['a', 1])`, but the
#   dtype of that array is `object`, which you probably want to avoid.
# 2. Why is broadcasting useful?
#   - It lets you perform operations between arrays that are compatable, but not nescessarily identical,
#   in shape. This makes your code cleaner.
# 3. How do you slice a DataFrame by row label?
#   - Use `.loc[label]`. For position based use `.iloc[integer]`.
# 4. How do you select a column of a DataFrame?
#   - Standard `__getitem__`: `df[column_name]`
# 5. Is the Index a column in the DataFrame?
#   - No. It isn't included in any operations (`mean`, etc). It can be inserted as a regular
#   column with `df.reset_index()`.