# glass identification dataset
import pandas as pd
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data'
col_names = ['id','ri','na','mg','al','si','k','ca','ba','fe','glass_type']
glass = pd.read_csv(url, names=col_names, index_col='id')
glass['assorted'] = glass.glass_type.map({1:0, 2:0, 3:0, 4:0, 5:1, 6:1, 7:1})
glass.head()
ri | na | mg | al | si | k | ca | ba | fe | glass_type | assorted | |
---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||
1 | 1.52101 | 13.64 | 4.49 | 1.10 | 71.78 | 0.06 | 8.75 | 0.0 | 0.0 | 1 | 0 |
2 | 1.51761 | 13.89 | 3.60 | 1.36 | 72.73 | 0.48 | 7.83 | 0.0 | 0.0 | 1 | 0 |
3 | 1.51618 | 13.53 | 3.55 | 1.54 | 72.99 | 0.39 | 7.78 | 0.0 | 0.0 | 1 | 0 |
4 | 1.51766 | 13.21 | 3.69 | 1.29 | 72.61 | 0.57 | 8.22 | 0.0 | 0.0 | 1 | 0 |
5 | 1.51742 | 13.27 | 3.62 | 1.24 | 73.08 | 0.55 | 8.07 | 0.0 | 0.0 | 1 | 0 |
Pretend that we want to predict ri, and our only feature is al. How would we do it using machine learning? We would frame it as a regression problem, and use a linear regression model with al as the only feature and ri as the response.
How would we visualize this model? Create a scatter plot with al on the x-axis and ri on the y-axis, and draw the line of best fit.
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.lmplot(x='al', y='ri', data=glass, ci=None)
<seaborn.axisgrid.FacetGrid at 0x12ec7a6d0>
If we had an al value of 2, what would we predict for ri? Roughly 1.517.
# Exercise: Draw the scatter plot using Pandas.
# scatter plot using Pandas
glass.plot(kind='scatter', x='al', y='ri')
<matplotlib.axes._subplots.AxesSubplot at 0x130c7e250>
# fit a linear regression model to predict ri from al
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
feature_cols = ['al']
X = glass[feature_cols]
y = glass.ri
linreg.fit(X, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
# look at the coefficients to get the equation for the line, but then how do you plot the line?
print linreg.intercept_
print linreg.coef_
1.52194533024 [-0.00247761]
# you could make predictions for arbitrary points, and then plot a line connecting them
print linreg.predict(1)
print linreg.predict(2)
print linreg.predict(3)
[ 1.51946772] [ 1.51699012] [ 1.51451251]
# or you could make predictions for all values of X, and then plot those predictions connected by a line
ri_pred = linreg.predict(X)
# draw regression line with matplotlib and pandas
plt.scatter(glass.al, glass.ri)
plt.plot(glass.al, ri_pred, color='red')
[<matplotlib.lines.Line2D at 0x130d3f510>]
Linear regression equation: $y = \beta_0 + \beta_1x$
# compute prediction for al=2 using the predict method
linreg.predict(2)
array([ 1.51699012])
# examine coefficient for al
pd.DataFrame(zip(feature_cols, linreg.coef_), columns=['feature', 'coef'])
feature | coef | |
---|---|---|
0 | al | -0.002478 |
# Note that we can't use a cross_val_score if we want to investigate variable relationships
Interpretation: A 1 unit increase in 'al' is associated with a 0.0025 unit decrease in 'ri'.
# compute prediction for al=3 using the predict method
linreg.predict(3)
array([ 1.51451251])
Let's change our task, so that we're predicting assorted using al. Let's visualize the relationship to figure out how to do this:
plt.scatter(glass.al, glass.assorted)
<matplotlib.collections.PathCollection at 0x1323058d0>
Let's draw a regression line, like we did before:
# fit a linear regression model and store the predictions
feature_cols = ['al']
X = glass[feature_cols]
y = glass.assorted
linreg.fit(X, y)
assorted_pred = linreg.predict(X)
# scatter plot that includes the regression line
plt.scatter(glass.al, glass.assorted)
plt.plot(glass.al, assorted_pred, color='red')
[<matplotlib.lines.Line2D at 0x132156310>]
If al=3, what class do we predict for assorted? 1
If al=1.5, what class do we predict for assorted? 0
So, we predict the 0 class for lower values of al, and the 1 class for higher values of al. What's our cutoff value? Around al=2, because that's where the linear regression line crosses the midpoint between predicting class 0 and class 1.
So, we'll say that if assorted_pred >= 0.5, we predict a class of 1, else we predict a class of 0.
# understanding np.where
import numpy as np
nums = np.array([5, 15, 8])
# np.where returns the first value if the condition is True, and the second value if the condition is False
np.where(nums > 10, 'big', 'small')
array(['small', 'big', 'small'], dtype='|S5')
# examine the predictions
assorted_pred[:10]
array([ 0.06545853, 0.19576455, 0.28597641, 0.16068216, 0.13562331, 0.32607057, 0.08550561, 0.04039968, 0.20077632, 0.19576455])
# transform predictions to 1 or 0
assorted_pred_class = np.where(assorted_pred >= 0.5, 1, 0)
assorted_pred_class
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1])
# plot the class predictions
plt.scatter(glass.al, glass.assorted)
plt.plot(glass.al, assorted_pred_class, color='red')
[<matplotlib.lines.Line2D at 0x132413a10>]
What went wrong? This is a line plot, and it connects points in the order they are found. Let's sort the DataFrame by "al" to fix this:
# add predicted class to DataFrame
glass['assorted_pred_class'] = assorted_pred_class
# sort DataFrame by al
glass.sort('al', inplace=True)
/Users/sinanozdemir/anaconda/envs/sfdat26-env/lib/python2.7/site-packages/ipykernel/__main__.py:5: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
# plot the class predictions again
plt.scatter(glass.al, glass.assorted)
plt.plot(glass.al, glass.assorted_pred_class, color='red')
[<matplotlib.lines.Line2D at 0x13270ea50>]
Logistic regression can do what we just did, but better..
# fit a linear regression model and store the class predictions
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
feature_cols = ['al']
X = glass[feature_cols]
y = glass.assorted
logreg.fit(X, y)
assorted_pred_class = logreg.predict(X)
# print the class predictions
assorted_pred_class
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
# plot the class predictions
plt.scatter(glass.al, glass.assorted)
plt.plot(glass.al, assorted_pred_class, color='red')
[<matplotlib.lines.Line2D at 0x130ad3f90>]
What if we wanted the predicted probabilities instead of just the class predictions, to understand how confident we are in a given prediction?
# store the predicted probabilites of class 1
assorted_pred_prob = logreg.predict_proba(X)[:, 1]
# plot the predicted probabilities
plt.scatter(glass.al, glass.assorted)
plt.plot(glass.al, assorted_pred_prob, color='red')
[<matplotlib.lines.Line2D at 0x1331ca9d0>]
# examine some example predictions
print logreg.predict_proba(1)
print logreg.predict_proba(2)
print logreg.predict_proba(3)
[[ 0.89253652 0.10746348]] [[ 0.52645662 0.47354338]] [[ 0.12953623 0.87046377]]
What is this? The first column indicates the predicted probability of class 0, and the second column indicates the predicted probability of class 1.
Examples:
# create a table of probability versus odds
table = pd.DataFrame({'probability':[0.1, 0.2, 0.25, 0.5, 0.6, 0.8, 0.9]})
table['odds'] = table.probability/(1 - table.probability)
table
probability | odds | |
---|---|---|
0 | 0.10 | 0.111111 |
1 | 0.20 | 0.250000 |
2 | 0.25 | 0.333333 |
3 | 0.50 | 1.000000 |
4 | 0.60 | 1.500000 |
5 | 0.80 | 4.000000 |
6 | 0.90 | 9.000000 |
What is e? It is the base rate of growth shared by all continually growing processes:
# exponential function: e^1
e = np.exp(1)
e
2.7182818284590451
What is a (natural) log? It gives you the time needed to reach a certain level of growth:
# time needed to grow 1 unit to 2.718 units
np.log(e)
1.0
It is also the inverse of the exponential function:
np.log(np.exp(5))
5.0
# add log-odds to the table
table['logodds'] = np.log(table.odds)
table
probability | odds | logodds | |
---|---|---|---|
0 | 0.10 | 0.111111 | -2.197225 |
1 | 0.20 | 0.250000 | -1.386294 |
2 | 0.25 | 0.333333 | -1.098612 |
3 | 0.50 | 1.000000 | 0.000000 |
4 | 0.60 | 1.500000 | 0.405465 |
5 | 0.80 | 4.000000 | 1.386294 |
6 | 0.90 | 9.000000 | 2.197225 |
Linear regression: continuous response is modeled as a linear combination of the features:
$$y = \beta_0 + \beta_1x$$Logistic regression: log-odds of a categorical response being "true" (1) is modeled as a linear combination of the features:
$$\log \left({p\over 1-p}\right) = \beta_0 + \beta_1x$$This is called the logit function.
Probability is sometimes written as pi:
$$\log \left({\pi\over 1-\pi}\right) = \beta_0 + \beta_1x$$The equation can be rearranged into the logistic function:
$$\pi = \frac{e^{\beta_0 + \beta_1x}} {1 + e^{\beta_0 + \beta_1x}}$$In other words:
The logistic function has some nice properties:
Notes:
# plot the predicted probabilities again
plt.scatter(glass.al, glass.assorted)
plt.plot(glass.al, assorted_pred_prob, color='red')
[<matplotlib.lines.Line2D at 0x1334d3d10>]
# compute predicted log-odds for al=2 using the equation
logodds = logreg.intercept_ + logreg.coef_ * 2
logodds
array([[-0.10592543]])
# convert log-odds to odds
odds = np.exp(logodds)
odds
array([[ 0.89949172]])
# convert odds to probability
prob = odds/(1 + odds)
prob
array([[ 0.47354338]])
# compute predicted probability for al=2 using the predict_proba method
logreg.predict_proba(2)[:, 1]
array([ 0.47354338])
# examine the coefficient for al
pd.DataFrame(zip(feature_cols, logreg.coef_), columns=['feature', 'coef'])
feature | coef | |
---|---|---|
0 | al | [2.01099096417] |
Interpretation: A 1 unit increase in 'al' is associated with a 2.0109 unit increase in the log-odds of 'assorted'.
# increasing al by 1 (so that al=3) increases the log-odds by 2.0109
# the -0.10592543 is the logodds we calculated a few cells ago for al=2
# I am stepping through the equation by one "unit" of al
logodds = -0.10592543 + 2.0109
odds = np.exp(logodds)
prob = odds/(1 + odds)
prob
0.87045351351387434
# compute predicted probability for al=3 using the predict_proba method
logreg.predict_proba(3)[:, 1]
array([ 0.87046377])
Bottom line: Positive coefficients increase the log-odds of the response (and thus increase the probability), and negative coefficients decrease the log-odds of the response (and thus decrease the probability).
# examine the intercept
logreg.intercept_
array([-4.12790736])
Interpretation: For an 'al' value of 0, the log-odds of 'assorted' is -4.127
# convert log-odds to probability
# Probability of assorted is low if al = 0
logodds = logreg.intercept_
odds = np.exp(logodds)
prob = odds/(1 + odds)
prob
array([ 0.01586095])
That makes sense from the plot above, because the probability of assorted=1 should be very low for such a low 'al' value.
Changing the $\beta_0$ value shifts the curve horizontally, whereas changing the $\beta_1$ value changes the slope of the curve.
Advantages of logistic regression:
Disadvantages of logistic regression:
from sklearn import metrics
preds = logreg.predict(X)
print metrics.confusion_matrix(y, preds)
# Note that we can't make this martix using cross_val_score so a train_test_split has to do!
[[160 3] [ 31 20]]
print metrics.classification_report(y, preds)
precision recall f1-score support 0 0.84 0.98 0.90 163 1 0.87 0.39 0.54 51 avg / total 0.85 0.84 0.82 214
# MORE DATA
# Logistic Regression is a high bias low variance model that is also non-parametric
from sklearn.datasets import make_circles
from sklearn.cross_validation import cross_val_score
circles_X, circles_y = make_circles(n_samples=1000, random_state=123, noise=0.1, factor=0.2)
plt.scatter(circles_X[:,0], circles_X[:,1])
<matplotlib.collections.PathCollection at 0x133a5fd50>
# It has a linear decision boundary, IE the shape is draws between classes are lines!
from matplotlib.colors import ListedColormap
import numpy as np
h = .02 # step size in the mesh
# Create color maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
# we create an instance of Neighbours Classifier and fit the data.
logreg = LogisticRegression()
logreg.fit(circles_X, circles_y)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = circles_X[:, 0].min() - 1, circles_X[:, 0].max() + 1
y_min, y_max = circles_X[:, 1].min() - 1, circles_X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot also the training points
plt.scatter(circles_X[:, 0], circles_X[:, 1], c=circles_y, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("Circle classification Logistic Regression")
plt.show()
logreg = LogisticRegression()
cross_val_score(logreg, circles_X, circles_y, cv=5, scoring='accuracy').mean()
# lame
0.48899999999999999
from sklearn.neighbors import KNeighborsClassifier # compare to knn
knn = KNeighborsClassifier(n_neighbors=7)
cross_val_score(knn, circles_X, circles_y, cv=5, scoring='accuracy').mean()
# not as lame, remember?
1.0
from sklearn import datasets
# new dataset, handwritten digits!
digits = datasets.load_digits()
digits.data
array([[ 0., 0., 5., ..., 0., 0., 0.], [ 0., 0., 0., ..., 10., 0., 0.], [ 0., 0., 0., ..., 16., 9., 0.], ..., [ 0., 0., 1., ..., 6., 0., 0.], [ 0., 0., 2., ..., 12., 0., 0.], [ 0., 0., 10., ..., 12., 1., 0.]])
plt.imshow(digits.images[-5], cmap=plt.cm.gray_r, interpolation='nearest')
# the number 9
digits.target[-5]
9
digits.data.shape
# 1,797 observations, 64 features (8 x 8 image)
(1797, 64)
digits_X, digits_y = digits.data, digits.target
logreg = LogisticRegression()
cross_val_score(logreg, digits_X, digits_y, cv=5, scoring='accuracy').mean()
0.92101881133607011
# compare to KNN
knn = KNeighborsClassifier(n_neighbors=5)
cross_val_score(knn, digits_X, digits_y, cv=5, scoring='accuracy').mean()
0.9627899114966898
# Thought Exercise, why would KNN potentially be a better model than logsitci regression
# for handwriting?
# OK so wait, when should we use Logistic Regression?
# Using dataset of a 1978 survey conducted to measure likliehood of women to perform extramarital affairs
# http://statsmodels.sourceforge.net/stable/datasets/generated/fair.html
import statsmodels.api as sm
affairs_df = sm.datasets.fair.load_pandas().data
affairs_df.head()
rate_marriage | age | yrs_married | children | religious | educ | occupation | occupation_husb | affairs | |
---|---|---|---|---|---|---|---|---|---|
0 | 3.0 | 32.0 | 9.0 | 3.0 | 3.0 | 17.0 | 2.0 | 5.0 | 0.111111 |
1 | 3.0 | 27.0 | 13.0 | 3.0 | 1.0 | 14.0 | 3.0 | 4.0 | 3.230769 |
2 | 4.0 | 22.0 | 2.5 | 0.0 | 1.0 | 16.0 | 3.0 | 5.0 | 1.400000 |
3 | 4.0 | 37.0 | 16.5 | 4.0 | 3.0 | 16.0 | 5.0 | 5.0 | 0.727273 |
4 | 5.0 | 27.0 | 9.0 | 1.0 | 1.0 | 14.0 | 3.0 | 4.0 | 4.666666 |
affairs_df['affair_binary'] = (affairs_df['affairs'] > 0)
sns.heatmap(affairs_df.corr())
<matplotlib.axes._subplots.AxesSubplot at 0x134050a90>
affairs_df.corr()
# Obviously affairs will correlate to affair_binary but what else?
# It seems children, yrs_married, rate_married, and age all correlate to affair_binary
# Remember correlations are NOT the single way to identify which features to use
# Correlations only give us a number determining how linearlly correlated the variables are
# We may find another variable that affects affairs by evaluating the coefficients of our LR
rate_marriage | age | yrs_married | children | religious | educ | occupation | occupation_husb | affairs | affair_binary | |
---|---|---|---|---|---|---|---|---|---|---|
rate_marriage | 1.000000 | -0.111127 | -0.128978 | -0.129161 | 0.078794 | 0.079869 | 0.039528 | 0.027745 | -0.178068 | -0.331776 |
age | -0.111127 | 1.000000 | 0.894082 | 0.673902 | 0.136598 | 0.027960 | 0.106127 | 0.162567 | -0.089964 | 0.146519 |
yrs_married | -0.128978 | 0.894082 | 1.000000 | 0.772806 | 0.132683 | -0.109058 | 0.041782 | 0.128135 | -0.087737 | 0.203109 |
children | -0.129161 | 0.673902 | 0.772806 | 1.000000 | 0.141845 | -0.141918 | -0.015068 | 0.086660 | -0.070278 | 0.159833 |
religious | 0.078794 | 0.136598 | 0.132683 | 0.141845 | 1.000000 | 0.032245 | 0.035746 | 0.004061 | -0.125933 | -0.129299 |
educ | 0.079869 | 0.027960 | -0.109058 | -0.141918 | 0.032245 | 1.000000 | 0.382286 | 0.183932 | -0.017740 | -0.075280 |
occupation | 0.039528 | 0.106127 | 0.041782 | -0.015068 | 0.035746 | 0.382286 | 1.000000 | 0.201156 | 0.004469 | 0.028981 |
occupation_husb | 0.027745 | 0.162567 | 0.128135 | 0.086660 | 0.004061 | 0.183932 | 0.201156 | 1.000000 | -0.015614 | 0.017637 |
affairs | -0.178068 | -0.089964 | -0.087737 | -0.070278 | -0.125933 | -0.017740 | 0.004469 | -0.015614 | 1.000000 | 0.464046 |
affair_binary | -0.331776 | 0.146519 | 0.203109 | 0.159833 | -0.129299 | -0.075280 | 0.028981 | 0.017637 | 0.464046 | 1.000000 |
affairs_X = affairs_df.drop(['affairs', 'affair_binary'], axis=1)
affairs_y = affairs_df['affair_binary']
model = LogisticRegression()
from sklearn.cross_validation import cross_val_score
# check the accuracy on the training set
scores = cross_val_score(model, affairs_X, affairs_y, cv=10)
print scores
print scores.mean()
# Looks pretty good
[ 0.71630094 0.69749216 0.74137931 0.71226415 0.70125786 0.73113208 0.71855346 0.70125786 0.74685535 0.75314465] 0.72196378226
# Explore individual features that make the biggest impact
# religious, yrs_married, and occupation. But one of these variables doesn't quite make sense right?
pd.DataFrame(zip(affairs_X.columns, np.transpose(model.coef_)))
0 | 1 | |
---|---|---|
0 | rate_marriage | [-0.702300201706] |
1 | age | [-0.0546769400998] |
2 | yrs_married | [0.105079088955] |
3 | children | [-0.00117231032] |
4 | religious | [-0.367121091053] |
5 | educ | [-0.0328106363897] |
6 | occupation | [0.161411859069] |
7 | occupation_husb | [0.0145734984752] |
# Dummy Variables:
# Encoding qualitiative (nominal) data using separate columns (see slides for linear regression for more)
occuptation_dummies = pd.get_dummies(affairs_df['occupation'], prefix='occ_').iloc[:, 1:]
# concatenate the dummy variable columns onto the original DataFrame (axis=0 means rows, axis=1 means columns)
affairs_df = pd.concat([affairs_df, occuptation_dummies], axis=1)
affairs_df.head()
rate_marriage | age | yrs_married | children | religious | educ | occupation | occupation_husb | affairs | affair_binary | occ__2.0 | occ__3.0 | occ__4.0 | occ__5.0 | occ__6.0 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3.0 | 32.0 | 9.0 | 3.0 | 3.0 | 17.0 | 2.0 | 5.0 | 0.111111 | True | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 3.0 | 27.0 | 13.0 | 3.0 | 1.0 | 14.0 | 3.0 | 4.0 | 3.230769 | True | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
2 | 4.0 | 22.0 | 2.5 | 0.0 | 1.0 | 16.0 | 3.0 | 5.0 | 1.400000 | True | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
3 | 4.0 | 37.0 | 16.5 | 4.0 | 3.0 | 16.0 | 5.0 | 5.0 | 0.727273 | True | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
4 | 5.0 | 27.0 | 9.0 | 1.0 | 1.0 | 14.0 | 3.0 | 4.0 | 4.666666 | True | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
occuptation_dummies = pd.get_dummies(affairs_df['occupation_husb'], prefix='occ_husb_').iloc[:, 1:]
# concatenate the dummy variable columns onto the original DataFrame (axis=0 means rows, axis=1 means columns)
affairs_df = pd.concat([affairs_df, occuptation_dummies], axis=1)
affairs_df.head()
rate_marriage | age | yrs_married | children | religious | educ | occupation | occupation_husb | affairs | affair_binary | occ__2.0 | occ__3.0 | occ__4.0 | occ__5.0 | occ__6.0 | occ_husb__2.0 | occ_husb__3.0 | occ_husb__4.0 | occ_husb__5.0 | occ_husb__6.0 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3.0 | 32.0 | 9.0 | 3.0 | 3.0 | 17.0 | 2.0 | 5.0 | 0.111111 | True | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1 | 3.0 | 27.0 | 13.0 | 3.0 | 1.0 | 14.0 | 3.0 | 4.0 | 3.230769 | True | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
2 | 4.0 | 22.0 | 2.5 | 0.0 | 1.0 | 16.0 | 3.0 | 5.0 | 1.400000 | True | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
3 | 4.0 | 37.0 | 16.5 | 4.0 | 3.0 | 16.0 | 5.0 | 5.0 | 0.727273 | True | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
4 | 5.0 | 27.0 | 9.0 | 1.0 | 1.0 | 14.0 | 3.0 | 4.0 | 4.666666 | True | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
# remove appropiate columns for feature set
affairs_X = affairs_df.drop(['affairs', 'affair_binary', 'occupation', 'occupation_husb'], axis=1)
affairs_y = affairs_df['affair_binary']
model = LogisticRegression()
model = model.fit(affairs_X, affairs_y)
# check the accuracy on the training set
model.score(affairs_X, affairs_y)
0.72588752748978946
pd.DataFrame(zip(affairs_X.columns, np.transpose(model.coef_)), columns = ['features', 'coef'])
features | coef | |
---|---|---|
0 | rate_marriage | [-0.697845453825] |
1 | age | [-0.0563368031972] |
2 | yrs_married | [0.103893444136] |
3 | children | [0.0181853982481] |
4 | religious | [-0.368506616998] |
5 | educ | [0.00864804494766] |
6 | occ__2.0 | [0.298118794658] |
7 | occ__3.0 | [0.608150180777] |
8 | occ__4.0 | [0.346511273036] |
9 | occ__5.0 | [0.942259498161] |
10 | occ__6.0 | [0.918150144304] |
11 | occ_husb__2.0 | [0.219957140288] |
12 | occ_husb__3.0 | [0.32476602929] |
13 | occ_husb__4.0 | [0.189354154353] |
14 | occ_husb__5.0 | [0.21309298898] |
15 | occ_husb__6.0 | [0.214179979671] |
# compare KNN to LR
knn = KNeighborsClassifier(n_neighbors=7)
cross_val_score(knn, affairs_X, affairs_y, cv=5, scoring='accuracy').mean()
0.68630248906529234
logreg = LogisticRegression()
cross_val_score(logreg, affairs_X, affairs_y, cv=5, scoring='accuracy').mean()
0.72558005785768587
# When we are investigating individual correlations between features and categorical responses
# Logistic regression has a good shot :)
# KNN relies on the entire n-space to make predictions while LR uses the model parameters to focus
# on one or more particular features
# LR has concept of "importance" of features
# Final Thought Experiment
# Why might KNN (a kind of look alike model) not perform well here?