In [3]:

from __future__ import division
import matplotlib.pyplot as plt
from sqlalchemy import *
import numpy as np
from sklearn.linear_model import LogisticRegression as LR
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.ensemble import GradientBoostingClassifier as GBC
from sklearn.svm import SVC
from sqlalchemy.orm import sessionmaker
from sqlalchemy import func
from sklearn.metrics import confusion_matrix
from sklearn.cross_validation import train_test_split
from churndata import *
from sklearn.preprocessing import StandardScaler
from pandas import DataFrame,Series
from pandas.core.groupby import GroupBy
import pandas as pd
from util import query_to_df
from util import *
import matplotlib.pyplot as plt

In [4]:

db = create_engine('sqlite:///forjar.db')


metadata = MetaData(db)

Session = sessionmaker(bind=db)


session = Session()


#used for later; defines the numerical mappings for each category
campaign_to_num = {
	'TW' : 1,
	'RE' : 2,
    'FB' : 3,
    'PI' : 4
}

event_to_num = {
   'like' : 1,
   'share' : 2,
   'nothing' : 3,
   'bought' : 4
}



meal_to_num = {
   'japanese':  1,
   'chinese' : 2,
   'french' : 3,
    'german' : 4,
    'italian' : 5,
    'mexican' : 6,
    'vietnamese' : 7
}



campaign_to_cost = {
    'TW' : .25,
    'RE' : .35,
    'FB' : .45,
    'PI' : .55
}

In [ ]:

q = session.query(Event).join(Meal,Event.Meal_Id == Meal.id).join(Users).add_entity(Meal).add_entity(Users)

Overview¶

The core objective of this exercise is to understand sql queries, pandas data frames, and general exploratory data analysis.

Towards the end, we will learn how to take our basic tools for exploratory data analysis and transform the inputs in to something

appropriate for a machine learning algorithm.

For our baseline classes, we will be using the classes from churndata.py . This contains all of the necessary classes from which

we will derive all of our analysis.

The associated database forjar.db contains our actual data set. We will need to use sql alchemy to load the data in.

Loading Data Via SQLAlchemy¶

First we will be loading the data from our sql lite database and doing a simple join.

Our objective here will be to get a list of the users who have bought something.

If we think in terms of our objectives of the site, it is to maximize revenue.

In any given day, we need to understand which of our users actually tend to buy things on the site.

If they aren't buying anything, we should do something about it given the data we know about them.

Exercise:

 1.  Goal: Load data from an sql database such that each you load a join of the users and events. We only want 
      the users who bought something.

     Steps:
          1. Load data via sqlalchemy from the sql lite database forjar.db
          2. Create a query that contains a join on events and a filter on bought. Look in to sqlalchemy 
             sessions.


 2. Load the results of a call to query.all() in to a data frame.

     Steps:
          1. Create a dataframe with pandas
          2. Set the columns to the query keys

The goal with this particular exercise is to understand which users are buying things so we can understand what attributes are successful.

Resources:¶

SQLLite/SQLAlchemy

Basic Querying

Pandas Data Structures

Load data via sqlalchemy from the sql lite database forjar.db¶

In [5]:

def query_to_df(session,query):
    """
    Convert an sql query to a pandas data frame
    """
    result = session.execute(query)
    d = DataFrame(result.fetchall())
    d.columns = result.keys()
    return d

In [6]:

q = session.query(Users.Campaign_ID,Event.Type,Users.id,Event.User_Id)
d = query_to_df(session,q)
print d
#print d.sort('Users_id')
grouped = d.groupby(['Users_Campaign_ID','Event_Type'])
print grouped.agg({'Event_Type' : np.count_nonzero}).sort('Event_Type')

/Users/junzhang/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated.

  warnings.warn(d.msg, DeprecationWarning)
/Users/junzhang/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated.

  warnings.warn(d.msg, DeprecationWarning)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10108176 entries, 0 to 10108175
Columns: 4 entries, Users_Campaign_ID to Event_User_Id
dtypes: object(4)
                              Event_Type
Users_Campaign_ID Event_Type            
RE                like            597313
TW                like            604498
RE                nothing         604795
FB                like            605935
PI                like            608330
TW                nothing         612070
FB                nothing         613525
PI                nothing         615950
RE                share           633476
TW                share           641096
FB                share           642620
PI                share           645160
RE                bought          663404
TW                bought          671384
FB                bought          672980
PI                bought          675640

Create a query that contains a join on events and a filter on bought. Look in to sqlalchemy sessions.¶

. Load the results of a call to query.all() in to a data frame.¶

In [8]:

d.head(10)

/Users/junzhang/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated.

  warnings.warn(d.msg, DeprecationWarning)
/Users/junzhang/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated.

  warnings.warn(d.msg, DeprecationWarning)

Out[8]:

	Users_Campaign_ID	Event_Type	Users_id	Event_User_Id
0	PI	bought	1	1
1	RE	bought	2	1
2	RE	bought	3	1
3	PI	bought	4	1
4	RE	bought	5	1
5	TW	bought	6	1
6	RE	bought	7	1
7	PI	bought	8	1
8	RE	bought	9	1
9	RE	bought	10	1

In [9]:

q = session.query(Users.Campaign_ID,Event.Type,Users.id,Event.User_Id).filter(Event.Type == 'bought')
d = query_to_df(session,q)
grouped = d.groupby('Users_id')
result = grouped.agg({'Event_Type' : np.count_nonzero}).sort('Event_Type')
print result

<class 'pandas.core.frame.DataFrame'>
Index: 5044 entries, 1 to 999
Data columns (total 1 columns):
Event_Type    5044  non-null values
dtypes: int64(1)

/Users/junzhang/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated.

  warnings.warn(d.msg, DeprecationWarning)

In [28]:

d.groupby('Users_id').sum()

/Users/junzhang/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated.

  warnings.warn(d.msg, DeprecationWarning)

Out[28]:

<class 'pandas.core.frame.DataFrame'>
Index: 5044 entries, 1 to 999
Data columns (total 3 columns):
Users_Campaign_ID    5044  non-null values
Event_Type           5044  non-null values
Event_User_Id        5044  non-null values
dtypes: object(3)

In [23]:

d.groupby('Users_Campaign_ID').describe()

/Users/junzhang/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated.

  warnings.warn(d.msg, DeprecationWarning)
/Users/junzhang/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated.

  warnings.warn(d.msg, DeprecationWarning)

Out[23]:

		Users_Campaign_ID	Event_Type	Users_id	Event_User_Id
Users_Campaign_ID
FB	count	672980	672980	672980	672980
	unique	1	1	1265	334
	top	FB	bought	3923	1
	freq	672980	672980	532	24035
PI	count	675640	675640	675640	675640
	unique	1	1	1270	334
	top	PI	bought	3927	1
	freq	675640	675640	532	24130
RE	count	663404	663404	663404	663404
	unique	1	1	1247	334
	top	RE	bought	4300	1
	freq	663404	663404	532	23693
TW	count	671384	671384	671384	671384
	unique	1	1	1262	334
	top	TW	bought	3922	1
	freq	671384	671384	532	23978

In [15]:

result.describe()

/Users/junzhang/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated.

  warnings.warn(d.msg, DeprecationWarning)
/Users/junzhang/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated.

  warnings.warn(d.msg, DeprecationWarning)

Out[15]:

	Event_Type
count	5044
mean	532
std	0
min	532
25%	532
50%	532
75%	532
max	532

In [27]:

result.sum()

Out[27]:

Event_Type    2683408
dtype: int64

In [10]:

result.head(5)

/Users/junzhang/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated.

  warnings.warn(d.msg, DeprecationWarning)
/Users/junzhang/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated.

  warnings.warn(d.msg, DeprecationWarning)

Out[10]:

	Event_Type
Users_id
1	532
4028	532
4027	532
4026	532
4025	532

Set the columns to the query keys¶

In [ ]:

Ranking¶

Now we will want to perform some sort of ranking, understanding attributes of who bought the most will allow us

to understand who our most profitable users are. Users who buy the most do not necessarily have the highest life time value,

but it is a great low hanging fruit for understanding where to begin understanding your users.

Exercise: Create a ranked grouping of the users who bought the most

     Steps:
            1. Using our previous query, we should be able to also rank the users who bought the most.
            2.  Again put the data in to a dataframe. RUn your aggregations and group bys via the data frame.

Resources¶

Queries

Using our previous query, we should be able to also rank the users who bought the most.¶

In [ ]:

q = session.query(Users.Campaign_ID,Event.Type,Users.id,Event.User_Id).filter(Event.Type == 'bought')
d = query_to_df(session,q)
grouped = d.groupby('Users_id')
result = grouped.agg({'Event_Type' : np.count_nonzero}).sort('Event_Type')
print result

Again put the data in to a dataframe. RUn your aggregations and group bys via the data frame.¶

In [24]:

q = session.query(Users.Campaign_ID, Event.Type, Users.id, Event.User_Id).filter(Event.Type == 'bought') 

In [25]:

d = query_to_df(session, q)

In [26]:

grouped = d.groupby('Users_id')

In [33]:

type(d)

Out[33]:

pandas.core.frame.DataFrame

Counts on campaign wise event actions¶

Now we will use Pandas to start doing some exploratory analysis. Let's compute some fairly simple statistics on our data.

Exercise:

 Steps:
       1. Load the data from sql alchemy in to a pandas data frame

       2. Using pandas, calculate the mean number of times each user

       3. Do a dual group by from each campaign type (facebook,twitter,pinterest,...) and each event type

Resources:¶

Pandas Blog post

Load the data from sql alchemy in to a pandas data frame¶

In [ ]:

Using pandas, calculate the mean number of times each user¶

Do a dual group by from each campaign type (facebook,twitter,pinterest,...) and each event type¶

Visualizing Data¶

Pandas has very powerful plotting built in to it alongside matplotlib. Let's generate scatterplots for all of the various user campaign to event types.

Pandas has a lot of built in tools for data vis. Underneath it uses matplot lib. One key thing of note here is that pandas will not actually render your plots for you.

To render plots after a call to something like dataframe.hist(), do the following:

 import matplotlib.pyplot as plt
 plt.show()

This will allow us to see correlations in events all at once.

Steps: 1. Load the data and do a join on: (Users.Campaign_ID,Event.Type,Users.id,Event.User_Id 2. If you have more than these columns in your dataframe subset them to this list of columns 3. Run an ordinal transform on each column. This is done via a 1 of k encoding mentioned earlier. (Look in to the dict.get function alongside df[columnname].apply() 4. Now iterate over every possible combination of columns and plot a scatter plot of each. Render these on the screen all at once. Look in to plt.subplot for this. The end goal here is to transform the string values in each column in the data frame to a numerical representation.

Load the data and do a join on: Users.Campaign_ID,Event.Type,Users.id,Event.User_Id¶

In [ ]:

If you have more than these columns in your dataframe subset them to this list of columns¶

In [ ]:

Run an ordinal transform on each column. This is done via a 1 of k encoding mentioned earlier.¶

Look in to the two below functions:¶

                  dict.get
                  df[columnname].apply()

In [ ]:

Now iterate over every possible combination of columns and plot a scatter plot of each.¶

Render these on the screen all at once.¶

Look in to¶

       plt.subplot

for this. The end goal here is to transform the string values in¶

each column in the data frame to a numerical representation.¶

In [31]:

def vectorize(df,label_column):
    """
    Vectorize input features wrt a label column.
    """
    feature_names = []
    for feature_name in df.columns.values:
        if feature_name != label_column:
            if label_column not in feature_names:
                feature_names.append(label_column)
    inputs = df[feature_names].index
    return inputs

Machine Learning Data input prep¶

For Machine Learning Algorithms, they can only accept numbers. Our specific task here will be to predict a label.

Exercise:

Let's build out a data frame such that we have an outcome label. Set the outcome label to be event type.

From here, binarize the event type outcome to be == bought or not.

Solution ¶

Thinking about this we need to break up the dataset, this is simply isolating the target column and putting it in¶

to its own data frame and dropping the column from the intended input set.¶

Data Normalization¶

Machine Learning algorithms typically work better if you scale the data (squish the data in to 0,1 range)

Exercise:

Query for Users.Campaign_ID,Meal.Type,Event.Type and load it in to a dataframe

Transform the data in to numerical (ordinal etc, think about what we did before)

Split out the feature set columns from the outcome label and normalize the given features.

Resources:¶

Scikit learn preprocessing

Solution ¶

Split out the feature set columns from the outcome label and normalize the giv

Query for Users.Campaign_ID,Meal.Type,Event.Type and load it in to a dataframe¶

In [ ]:

Transform the data in to numerical (ordinal etc, think about what we did before)¶

In [ ]:

Overview¶

Loading Data Via SQLAlchemy¶

Resources:¶

Solution¶

Load data via sqlalchemy from the sql lite database forjar.db¶

Create a query that contains a join on events and a filter on bought. Look in to sqlalchemy sessions.¶

. Load the results of a call to query.all() in to a data frame.¶

Set the columns to the query keys¶

Ranking¶

Resources¶

Solution¶

Using our previous query, we should be able to also rank the users who bought the most.¶

Again put the data in to a dataframe. RUn your aggregations and group bys via the data frame.¶

Counts on campaign wise event actions¶

Resources:¶

Solution¶

Load the data from sql alchemy in to a pandas data frame¶

Using pandas, calculate the mean number of times each user¶

Do a dual group by from each campaign type (facebook,twitter,pinterest,...) and each event type¶

Visualizing Data¶

Load the data and do a join on: Users.Campaign_ID,Event.Type,Users.id,Event.User_Id¶

If you have more than these columns in your dataframe subset them to this list of columns¶

Run an ordinal transform on each column. This is done via a 1 of k encoding mentioned earlier.¶

Look in to the two below functions:¶

Now iterate over every possible combination of columns and plot a scatter plot of each.¶

Render these on the screen all at once.¶

Look in to¶

for this. The end goal here is to transform the string values in¶

each column in the data frame to a numerical representation.¶

Machine Learning Data input prep¶

Solution¶

Thinking about this we need to break up the dataset, this is simply isolating the target column and putting it in¶

to its own data frame and dropping the column from the intended input set.¶

Data Normalization¶

Resources:¶

Solution¶

Query for Users.Campaign_ID,Meal.Type,Event.Type and load it in to a dataframe¶

Transform the data in to numerical (ordinal etc, think about what we did before)¶

Solution ¶

Solution ¶

Solution ¶

Solution ¶

Solution ¶