from __future__ import division
import matplotlib.pyplot as plt
from sqlalchemy import *
import numpy as np
from sklearn.linear_model import LogisticRegression as LR
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.ensemble import GradientBoostingClassifier as GBC
from sklearn.svm import SVC
from sqlalchemy.orm import sessionmaker
from sqlalchemy import func
from sklearn.metrics import confusion_matrix
from sklearn.cross_validation import train_test_split
from churndata import *
from sklearn.preprocessing import StandardScaler
from pandas import DataFrame,Series
from pandas.core.groupby import GroupBy
import pandas as pd
from util import query_to_df
from util import *
import matplotlib.pyplot as plt
db = create_engine('sqlite:///forjar.db')
metadata = MetaData(db)
Session = sessionmaker(bind=db)
session = Session()
#used for later; defines the numerical mappings for each category
campaign_to_num = {
'TW' : 1,
'RE' : 2,
'FB' : 3,
'PI' : 4
}
event_to_num = {
'like' : 1,
'share' : 2,
'nothing' : 3,
'bought' : 4
}
meal_to_num = {
'japanese': 1,
'chinese' : 2,
'french' : 3,
'german' : 4,
'italian' : 5,
'mexican' : 6,
'vietnamese' : 7
}
campaign_to_cost = {
'TW' : .25,
'RE' : .35,
'FB' : .45,
'PI' : .55
}
q = session.query(Event).join(Meal,Event.Meal_Id == Meal.id).join(Users).add_entity(Meal).add_entity(Users)
The core objective of this exercise is to understand sql queries, pandas data frames, and general exploratory data analysis.
Towards the end, we will learn how to take our basic tools for exploratory data analysis and transform the inputs in to something
appropriate for a machine learning algorithm.
For our baseline classes, we will be using the classes from churndata.py . This contains all of the necessary classes from which
we will derive all of our analysis.
The associated database forjar.db contains our actual data set. We will need to use sql alchemy to load the data in.
First we will be loading the data from our sql lite database and doing a simple join.
Our objective here will be to get a list of the users who have bought something.
If we think in terms of our objectives of the site, it is to maximize revenue.
In any given day, we need to understand which of our users actually tend to buy things on the site.
If they aren't buying anything, we should do something about it given the data we know about them.
Exercise:
1. Goal: Load data from an sql database such that each you load a join of the users and events. We only want
the users who bought something.
Steps:
1. Load data via sqlalchemy from the sql lite database forjar.db
2. Create a query that contains a join on events and a filter on bought. Look in to sqlalchemy
sessions.
2. Load the results of a call to query.all() in to a data frame.
Steps:
1. Create a dataframe with pandas
2. Set the columns to the query keys
The goal with this particular exercise is to understand which users are buying things so we can understand what attributes are successful.
def query_to_df(session,query):
"""
Convert an sql query to a pandas data frame
"""
result = session.execute(query)
d = DataFrame(result.fetchall())
d.columns = result.keys()
return d
q = session.query(Users.Campaign_ID,Event.Type,Users.id,Event.User_Id)
d = query_to_df(session,q)
print d
#print d.sort('Users_id')
grouped = d.groupby(['Users_Campaign_ID','Event_Type'])
print grouped.agg({'Event_Type' : np.count_nonzero}).sort('Event_Type')
/Users/junzhang/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated. warnings.warn(d.msg, DeprecationWarning) /Users/junzhang/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated. warnings.warn(d.msg, DeprecationWarning)
<class 'pandas.core.frame.DataFrame'> Int64Index: 10108176 entries, 0 to 10108175 Columns: 4 entries, Users_Campaign_ID to Event_User_Id dtypes: object(4) Event_Type Users_Campaign_ID Event_Type RE like 597313 TW like 604498 RE nothing 604795 FB like 605935 PI like 608330 TW nothing 612070 FB nothing 613525 PI nothing 615950 RE share 633476 TW share 641096 FB share 642620 PI share 645160 RE bought 663404 TW bought 671384 FB bought 672980 PI bought 675640
d.head(10)
/Users/junzhang/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated. warnings.warn(d.msg, DeprecationWarning) /Users/junzhang/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated. warnings.warn(d.msg, DeprecationWarning)
Users_Campaign_ID | Event_Type | Users_id | Event_User_Id | |
---|---|---|---|---|
0 | PI | bought | 1 | 1 |
1 | RE | bought | 2 | 1 |
2 | RE | bought | 3 | 1 |
3 | PI | bought | 4 | 1 |
4 | RE | bought | 5 | 1 |
5 | TW | bought | 6 | 1 |
6 | RE | bought | 7 | 1 |
7 | PI | bought | 8 | 1 |
8 | RE | bought | 9 | 1 |
9 | RE | bought | 10 | 1 |
q = session.query(Users.Campaign_ID,Event.Type,Users.id,Event.User_Id).filter(Event.Type == 'bought')
d = query_to_df(session,q)
grouped = d.groupby('Users_id')
result = grouped.agg({'Event_Type' : np.count_nonzero}).sort('Event_Type')
print result
<class 'pandas.core.frame.DataFrame'> Index: 5044 entries, 1 to 999 Data columns (total 1 columns): Event_Type 5044 non-null values dtypes: int64(1)
/Users/junzhang/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated. warnings.warn(d.msg, DeprecationWarning)
d.groupby('Users_id').sum()
/Users/junzhang/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated. warnings.warn(d.msg, DeprecationWarning)
<class 'pandas.core.frame.DataFrame'> Index: 5044 entries, 1 to 999 Data columns (total 3 columns): Users_Campaign_ID 5044 non-null values Event_Type 5044 non-null values Event_User_Id 5044 non-null values dtypes: object(3)
d.groupby('Users_Campaign_ID').describe()
/Users/junzhang/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated. warnings.warn(d.msg, DeprecationWarning) /Users/junzhang/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated. warnings.warn(d.msg, DeprecationWarning)
Users_Campaign_ID | Event_Type | Users_id | Event_User_Id | ||
---|---|---|---|---|---|
Users_Campaign_ID | |||||
FB | count | 672980 | 672980 | 672980 | 672980 |
unique | 1 | 1 | 1265 | 334 | |
top | FB | bought | 3923 | 1 | |
freq | 672980 | 672980 | 532 | 24035 | |
PI | count | 675640 | 675640 | 675640 | 675640 |
unique | 1 | 1 | 1270 | 334 | |
top | PI | bought | 3927 | 1 | |
freq | 675640 | 675640 | 532 | 24130 | |
RE | count | 663404 | 663404 | 663404 | 663404 |
unique | 1 | 1 | 1247 | 334 | |
top | RE | bought | 4300 | 1 | |
freq | 663404 | 663404 | 532 | 23693 | |
TW | count | 671384 | 671384 | 671384 | 671384 |
unique | 1 | 1 | 1262 | 334 | |
top | TW | bought | 3922 | 1 | |
freq | 671384 | 671384 | 532 | 23978 |
result.describe()
/Users/junzhang/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated. warnings.warn(d.msg, DeprecationWarning) /Users/junzhang/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated. warnings.warn(d.msg, DeprecationWarning)
Event_Type | |
---|---|
count | 5044 |
mean | 532 |
std | 0 |
min | 532 |
25% | 532 |
50% | 532 |
75% | 532 |
max | 532 |
result.sum()
Event_Type 2683408 dtype: int64
result.head(5)
/Users/junzhang/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated. warnings.warn(d.msg, DeprecationWarning) /Users/junzhang/anaconda/lib/python2.7/site-packages/pandas/core/config.py:570: DeprecationWarning: height has been deprecated. warnings.warn(d.msg, DeprecationWarning)
Event_Type | |
---|---|
Users_id | |
1 | 532 |
4028 | 532 |
4027 | 532 |
4026 | 532 |
4025 | 532 |
Now we will want to perform some sort of ranking, understanding attributes of who bought the most will allow us
to understand who our most profitable users are. Users who buy the most do not necessarily have the highest life time value,
but it is a great low hanging fruit for understanding where to begin understanding your users.
Exercise: Create a ranked grouping of the users who bought the most
Steps:
1. Using our previous query, we should be able to also rank the users who bought the most.
2. Again put the data in to a dataframe. RUn your aggregations and group bys via the data frame.
q = session.query(Users.Campaign_ID,Event.Type,Users.id,Event.User_Id).filter(Event.Type == 'bought')
d = query_to_df(session,q)
grouped = d.groupby('Users_id')
result = grouped.agg({'Event_Type' : np.count_nonzero}).sort('Event_Type')
print result
q = session.query(Users.Campaign_ID, Event.Type, Users.id, Event.User_Id).filter(Event.Type == 'bought')
d = query_to_df(session, q)
grouped = d.groupby('Users_id')
type(d)
pandas.core.frame.DataFrame
Now we will use Pandas to start doing some exploratory analysis. Let's compute some fairly simple statistics on our data.
Exercise:
Steps:
1. Load the data from sql alchemy in to a pandas data frame
2. Using pandas, calculate the mean number of times each user
3. Do a dual group by from each campaign type (facebook,twitter,pinterest,...) and each event type
"""
Pandas has very powerful plotting built in to it alongside matplotlib. Let's generate scatterplots for all of the various user campaign to event types.
Pandas has a lot of built in tools for data vis. Underneath it uses matplot lib. One key thing of note here is that pandas will not actually render your plots for you.
To render plots after a call to something like dataframe.hist(), do the following:
import matplotlib.pyplot as plt
plt.show()
This will allow us to see correlations in events all at once.
Steps: 1. Load the data and do a join on: (Users.Campaign_ID,Event.Type,Users.id,Event.User_Id 2. If you have more than these columns in your dataframe subset them to this list of columns 3. Run an ordinal transform on each column. This is done via a 1 of k encoding mentioned earlier. (Look in to the dict.get function alongside df[columnname].apply() 4. Now iterate over every possible combination of columns and plot a scatter plot of each. Render these on the screen all at once. Look in to plt.subplot for this. The end goal here is to transform the string values in each column in the data frame to a numerical representation.
def vectorize(df,label_column):
"""
Vectorize input features wrt a label column.
"""
feature_names = []
for feature_name in df.columns.values:
if feature_name != label_column:
if label_column not in feature_names:
feature_names.append(label_column)
inputs = df[feature_names].index
return inputs
For Machine Learning Algorithms, they can only accept numbers. Our specific task here will be to predict a label.
Exercise:
Let's build out a data frame such that we have an outcome label. Set the outcome label to be event type.
From here, binarize the event type outcome to be == bought or not.
Machine Learning algorithms typically work better if you scale the data (squish the data in to 0,1 range)
Exercise:
Query for Users.Campaign_ID,Meal.Type,Event.Type and load it in to a dataframe
Transform the data in to numerical (ordinal etc, think about what we did before)
Split out the feature set columns from the outcome label and normalize the given features.
Split out the feature set columns from the outcome label and normalize the giv