**Get Data** - Our data set will consist of an Excel file containing customer counts per date. We will learn how to read in the excel file for processing.

**Prepare Data** - The data is an irregular time series having duplicate dates. We will be challenged in compressing the data and coming up with next years forecasted customer count.

**Analyze Data** - We use graphs to visualize trends and spot outliers. Some built in computational tools will be used to calculate next years forecasted customer count.

**Present Data** - The results will be plotted.

***NOTE:
Make sure you have looked through all previous lessons, as the knowledge learned in previous lessons will be
needed for this exercise.***

In [1]:

```
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy.random as np
import sys
%matplotlib inline
```

In [2]:

```
print 'Python version ' + sys.version
print 'Pandas version: ' + pd.__version__
```

We will be creating our own test data for analysis.

In [3]:

```
# set seed
np.seed(111)
# Function to generate test data
def CreateDataSet(Number=1):
Output = []
for i in range(Number):
# Create a weekly (mondays) date range
rng = pd.date_range(start='1/1/2009', end='12/31/2012', freq='W-MON')
# Create random data
data = np.randint(low=25,high=1000,size=len(rng))
# Status pool
status = [1,2,3]
# Make a random list of statuses
random_status = [status[np.randint(low=0,high=len(status))] for i in range(len(rng))]
# State pool
states = ['GA','FL','fl','NY','NJ','TX']
# Make a random list of states
random_states = [states[np.randint(low=0,high=len(states))] for i in range(len(rng))]
Output.extend(zip(random_states, random_status, data, rng))
return Output
```

In [4]:

```
dataset = CreateDataSet(4)
df = pd.DataFrame(data=dataset, columns=['State','Status','CustomerCount','StatusDate'])
df.info()
```

In [5]:

```
df.head()
```

Out[5]:

We are now going to save this dataframe into an Excel file, to then bring it back to a dataframe. We simply do this to show you how to read and write to Excel files.

We do not write the index values of the dataframe to the Excel file, since they are not meant to be part of our initial test data set.

In [6]:

```
# Save results to excel
df.to_excel('Lesson3.xlsx', index=False)
print 'Done'
```

We will be using the ** read_excel** function to read in data from an Excel file. The function allows you to read in specfic tabs by name or location.

In [7]:

```
pd.read_excel?
```

In [8]:

```
# Location of file
Location = r'C:\Users\david\notebooks\pandas\Lesson3.xlsx'
# Parse a specific sheet
df = pd.read_excel(Location, 0, index_col='StatusDate')
df.dtypes
```

Out[8]:

In [9]:

```
df.index
```

Out[9]:

In [10]:

```
df.head()
```

Out[10]:

This section attempts to clean up the data for analysis.

- Make sure the state column is all in upper case
- Only select records where the account status is equal to "1"
- Merge (NJ and NY) to NY in the state column
- Remove any outliers (any odd results in the data set)

Lets take a quick look on how some of the *State* values are upper case and some are lower case

In [11]:

```
df['State'].unique()
```

Out[11]:

** upper()** function and the dataframe's

In [12]:

```
# Clean State Column, convert to upper case
df['State'] = df.State.apply(lambda x: x.upper())
```

In [13]:

```
df['State'].unique()
```

Out[13]:

In [14]:

```
# Only grab where Status == 1
mask = df['Status'] == 1
df = df[mask]
```

To turn the ** NJ** states to

** [df.State == 'NJ']** - Find all records in the

In [15]:

```
# Convert NJ to NY
mask = df.State == 'NJ'
df['State'][mask] = 'NY'
```

Now we can see we have a much cleaner data set to work with.

In [16]:

```
df['State'].unique()
```

Out[16]:

At this point we may want to graph the data to check for any outliers or inconsistencies in the data. We will be using the ** plot()** attribute of the dataframe.

As you can see from the graph below it is not very conclusive and is probably a sign that we need to perform some more data preparation.

In [17]:

```
df['CustomerCount'].plot(figsize=(15,5));
```

** CustomerCount** column per State, StatusDate, and Status we will get the

In [18]:

```
sortdf = df[df['State']=='NY'].sort(axis=0)
sortdf.head(10)
```

Out[18]:

Our task is now to create a new dataframe that compresses the data so we have daily customer counts per State and StatusDate. We can ignore the Status column since all the values in this column are of value *1*. To accomplish this we will use the dataframe's functions ** groupby** and

Note that we had to use **reset_index** . If we did not, we would not have been able to group by both the State and the StatusDate since the groupby function expects only columns as inputs. The **reset_index** function will bring the index ** StatusDate** back to a column in the dataframe.

In [19]:

```
# Group by State and StatusDate
Daily = df.reset_index().groupby(['State','StatusDate']).sum()
Daily.head()
```

Out[19]:

The ** State** and

Below we delete the ** Status** column since it is all equal to one and no longer necessary.

In [20]:

```
del Daily['Status']
Daily.head()
```

Out[20]:

In [21]:

```
# What is the index of the dataframe
Daily.index
```

Out[21]:

In [22]:

```
# Select the State index
Daily.index.levels[0]
```

Out[22]:

In [23]:

```
# Select the StatusDate index
Daily.index.levels[1]
```

Out[23]:

Lets now plot the data per State.

As you can see by breaking the graph up by the ** State** column we have a much clearer picture on how the data looks like. Can you spot any outliers?

In [24]:

```
Daily.loc['FL'].plot()
Daily.loc['GA'].plot()
Daily.loc['NY'].plot()
Daily.loc['TX'].plot();
```

** 2012**. We can now clearly see that the data for these states is all over the place. since the data consist of weekly customer counts, the variability of the data seems suspect. For this tutorial we will assume bad data and proceed.

In [25]:

```
Daily.loc['FL']['2012':].plot()
Daily.loc['GA']['2012':].plot()
Daily.loc['NY']['2012':].plot()
Daily.loc['TX']['2012':].plot();
```

We will assume that per month the customer count should remain relatively steady. Any data outside a specific range in that month will be removed from the data set. The final result should have smooth graphs with no spikes.

** StateYearMonth** - Here we group by State, Year of StatusDate, and Month of StatusDate.

We will be using the attribute ** transform** instead of

In [26]:

```
# Calculate Outliers
StateYearMonth = Daily.groupby([Daily.index.get_level_values(0), Daily.index.get_level_values(1).year, Daily.index.get_level_values(1).month])
Daily['Lower'] = StateYearMonth['CustomerCount'].transform( lambda x: x.quantile(q=.25) - (1.5*x.quantile(q=.75)-x.quantile(q=.25)) )
Daily['Upper'] = StateYearMonth['CustomerCount'].transform( lambda x: x.quantile(q=.75) + (1.5*x.quantile(q=.75)-x.quantile(q=.25)) )
Daily['Outlier'] = (Daily['CustomerCount'] < Daily['Lower']) | (Daily['CustomerCount'] > Daily['Upper'])
# Remove Outliers
Daily = Daily[Daily['Outlier'] == False]
```

** Daily** will hold customer counts that have been aggregated per day. The original data (df) has multiple records per day. We are left with a data set that is indexed by both the state and the StatusDate. The Outlier column should be equal to

In [27]:

```
Daily.head()
```

Out[27]:

** ALL** which groups the Daily dataframe by StatusDate. We are essentially getting rid of the

In [28]:

```
# Combine all markets
# Get the max customer count by Date
ALL = pd.DataFrame(Daily['CustomerCount'].groupby(Daily.index.get_level_values(1)).sum())
ALL.columns = ['CustomerCount'] # rename column
# Group by Year and Month
YearMonth = ALL.groupby([lambda x: x.year, lambda x: x.month])
# What is the max customer count per Year and Month
ALL['Max'] = YearMonth['CustomerCount'].transform(lambda x: x.max())
ALL.head()
```

Out[28]:

** ALL** dataframe above, in the month of January 2009, the maximum customer count was 901. If we had used

There is also an interest to gauge if the current customer counts were reaching certain goals the company had established. The task here is to visually show if the current customer counts are meeting the goals listed below. We will call the goals ** BHAG** (Big Hairy Annual Goal).

- 12/31/2011 - 1,000 customers
- 12/31/2012 - 2,000 customers
- 12/31/2013 - 3,000 customers

We will be using the **date_range** function to create our dates.

** Definition:** date_range(start=None, end=None, periods=None, freq='D', tz=None, normalize=False, name=None, closed=None)

By choosing the frequency to be ** A** or annual we will be able to get the three target dates from above.

In [29]:

```
date_range?
```

In [30]:

```
# Create the BHAG dataframe
data = [1000,2000,3000]
idx = pd.date_range(start='12/31/2011', end='12/31/2013', freq='A')
BHAG = pd.DataFrame(data, index=idx, columns=['BHAG'])
BHAG
```

Out[30]:

** concat** function. Remember when we choose

In [31]:

```
# Combine the BHAG and the ALL data set
combined = pd.concat([ALL,BHAG], axis=0)
combined = combined.sort(axis=0)
combined.tail()
```

Out[31]:

In [32]:

```
fig, axes = plt.subplots(figsize=(12, 7))
combined['BHAG'].fillna(method='pad').plot(color='green', label='BHAG')
combined['Max'].plot(color='blue', label='All Markets')
plt.legend(loc='best');
```

** combined** dataframe by

In [33]:

```
# Group by Year and then get the max value per year
Year = combined.groupby(lambda x: x.year).max()
Year
```

Out[33]:

In [34]:

```
# Add a column representing the percent change per year
Year['YR_PCT_Change'] = Year['Max'].pct_change(periods=1)
Year
```

Out[34]:

In [35]:

```
(1 + Year.ix[2012,'YR_PCT_Change']) * Year.ix[2012,'Max']
```

Out[35]:

Create individual Graphs per State.

In [36]:

```
# First Graph
ALL['Max'].plot(figsize=(10, 5));plt.title('ALL Markets')
# Last four Graphs
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(20, 10))
fig.subplots_adjust(hspace=1.0) ## Create space between plots
Daily.loc['FL']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[0,0])
Daily.loc['GA']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[0,1])
Daily.loc['TX']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[1,0])
Daily.loc['NY']['CustomerCount']['2012':].fillna(method='pad').plot(ax=axes[1,1])
# Add titles
axes[0,0].set_title('Florida')
axes[0,1].set_title('Georgia')
axes[1,0].set_title('Texas')
axes[1,1].set_title('North East');
```

**Author:** David Rojas