# glass identification dataset
import pandas as pd
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data'
col_names = ['id','ri','na','mg','al','si','k','ca','ba','fe','glass_type']
glass = pd.read_csv(url, names=col_names, index_col='id')
glass['assorted'] = glass.glass_type.map({1:0, 2:0, 3:0, 4:0, 5:1, 6:1, 7:1})
glass.head()
ri | na | mg | al | si | k | ca | ba | fe | glass_type | assorted | |
---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||
1 | 1.52101 | 13.64 | 4.49 | 1.10 | 71.78 | 0.06 | 8.75 | 0 | 0 | 1 | 0 |
2 | 1.51761 | 13.89 | 3.60 | 1.36 | 72.73 | 0.48 | 7.83 | 0 | 0 | 1 | 0 |
3 | 1.51618 | 13.53 | 3.55 | 1.54 | 72.99 | 0.39 | 7.78 | 0 | 0 | 1 | 0 |
4 | 1.51766 | 13.21 | 3.69 | 1.29 | 72.61 | 0.57 | 8.22 | 0 | 0 | 1 | 0 |
5 | 1.51742 | 13.27 | 3.62 | 1.24 | 73.08 | 0.55 | 8.07 | 0 | 0 | 1 | 0 |
Pretend that we want to predict ri, and our only feature is al. How would we do it using machine learning? We would frame it as a regression problem, and use a linear regression model with al as the only feature and ri as the response.
How would we visualize this model? Create a scatter plot with al on the x-axis and ri on the y-axis, and draw the line of best fit.
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.lmplot(x='al', y='ri', data=glass, ci=None)
<seaborn.axisgrid.FacetGrid at 0x18bebfd0>
If we had an al value of 2, what would we predict for ri? Roughly 1.517.
Exercise: Draw this plot without using Seaborn.
# scatter plot using Pandas
glass.plot(kind='scatter', x='al', y='ri')
<matplotlib.axes._subplots.AxesSubplot at 0x18c94b70>
# scatter plot using Matplotlib
plt.scatter(glass.al, glass.ri)
<matplotlib.collections.PathCollection at 0x19168b70>
# fit a linear regression model
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
feature_cols = ['al']
X = glass[feature_cols]
y = glass.ri
linreg.fit(X, y)
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)
# look at the coefficients to get the equation for the line, but then how do you plot the line?
print linreg.intercept_
print linreg.coef_
1.52194533024 [-0.00247761]
# you could make predictions for arbitrary points, and then plot a line connecting them
print linreg.predict(1)
print linreg.predict(2)
print linreg.predict(3)
[ 1.51946772] [ 1.51699012] [ 1.51451251]
# or you could make predictions for all values of X, and then plot those predictions connected by a line
ri_pred = linreg.predict(X)
plt.plot(glass.al, ri_pred, color='red')
[<matplotlib.lines.Line2D at 0x1a2d89b0>]
# put the plots together
plt.scatter(glass.al, glass.ri)
plt.plot(glass.al, ri_pred, color='red')
[<matplotlib.lines.Line2D at 0x1a2c6080>]
Linear regression equation: $y = \beta_0 + \beta_1x$
# compute prediction for al=2 using the equation
linreg.intercept_ + linreg.coef_ * 2
array([ 1.51699012])
# compute prediction for al=2 using the predict method
linreg.predict(2)
array([ 1.51699012])
# examine coefficient for al
zip(feature_cols, linreg.coef_)
[('al', -0.0024776063874696243)]
Interpretation: A 1 unit increase in 'al' is associated with a 0.0025 unit decrease in 'ri'.
# increasing al by 1 (so that al=3) decreases ri by 0.0025
1.51699012 - 0.0024776063874696243
1.5145125136125304
# compute prediction for al=3 using the predict method
linreg.predict(3)
array([ 1.51451251])
Let's change our task, so that we're predicting assorted using al. Let's visualize the relationship to figure out how to do this:
plt.scatter(glass.al, glass.assorted)
<matplotlib.collections.PathCollection at 0x1a655710>
Let's draw a regression line, like we did before:
# fit a linear regression model and store the predictions
feature_cols = ['al']
X = glass[feature_cols]
y = glass.assorted
linreg.fit(X, y)
assorted_pred = linreg.predict(X)
# scatter plot that includes the regression line
plt.scatter(glass.al, glass.assorted)
plt.plot(glass.al, assorted_pred, color='red')
[<matplotlib.lines.Line2D at 0x1a8d2080>]
If al=3, what class do we predict for assorted? 1
If al=1.5, what class do we predict for assorted? 0
So, we predict the 0 class for lower values of al, and the 1 class for higher values of al. What's our cutoff value? Around al=2, because that's where the linear regression line crosses the midpoint between predicting class 0 and class 1.
So, we'll say that if assorted_pred >= 0.5, we predict a class of 1, else we predict a class of 0.
# understanding np.where
import numpy as np
nums = np.array([5, 15, 8])
# np.where returns the first value if the condition is True, and the second value if the condition is False
np.where(nums > 10, 'big', 'small')
array(['small', 'big', 'small'], dtype='|S5')
# examine the predictions
assorted_pred[:10]
array([ 0.06545853, 0.19576455, 0.28597641, 0.16068216, 0.13562331, 0.32607057, 0.08550561, 0.04039968, 0.20077632, 0.19576455])
# transform predictions to 1 or 0
assorted_pred_class = np.where(assorted_pred >= 0.5, 1, 0)
assorted_pred_class
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1])
# plot the class predictions
plt.scatter(glass.al, glass.assorted)
plt.plot(glass.al, assorted_pred_class, color='red')
[<matplotlib.lines.Line2D at 0x1a8d20f0>]
What went wrong? This is a line plot, and it connects points in the order they are found. Let's sort the DataFrame by "al" to fix this:
# add predicted class to DataFrame
glass['assorted_pred_class'] = assorted_pred_class
# sort DataFrame by al
glass.sort('al', inplace=True)
# plot the class predictions again
plt.scatter(glass.al, glass.assorted)
plt.plot(glass.al, glass.assorted_pred_class, color='red')
[<matplotlib.lines.Line2D at 0x1adc1b70>]
Logistic regression can do what we just did:
# fit a linear regression model and store the class predictions
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9)
feature_cols = ['al']
X = glass[feature_cols]
y = glass.assorted
logreg.fit(X, y)
assorted_pred_class = logreg.predict(X)
# print the class predictions
assorted_pred_class
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)
# plot the class predictions
plt.scatter(glass.al, glass.assorted)
plt.plot(glass.al, assorted_pred_class, color='red')
[<matplotlib.lines.Line2D at 0x1ab0dac8>]
What if we wanted the predicted probabilities instead of just the class predictions, to understand how confident we are in a given prediction?
# store the predicted probabilites of class 1
assorted_pred_prob = logreg.predict_proba(X)[:, 1]
# plot the predicted probabilities
plt.scatter(glass.al, glass.assorted)
plt.plot(glass.al, assorted_pred_prob, color='red')
[<matplotlib.lines.Line2D at 0x1b42cd68>]
# examine some example predictions
print logreg.predict_proba(1)
print logreg.predict_proba(2)
print logreg.predict_proba(3)
[[ 0.97161726 0.02838274]] [[ 0.34361555 0.65638445]] [[ 0.00794192 0.99205808]]
What is this? The first column indicates the predicted probability of class 0, and the second column indicates the predicted probability of class 1.
Examples:
# create a table of probability versus odds
table = pd.DataFrame({'probability':[0.1, 0.2, 0.25, 0.5, 0.6, 0.8, 0.9]})
table['odds'] = table.probability/(1 - table.probability)
table
probability | odds | |
---|---|---|
0 | 0.10 | 0.111111 |
1 | 0.20 | 0.250000 |
2 | 0.25 | 0.333333 |
3 | 0.50 | 1.000000 |
4 | 0.60 | 1.500000 |
5 | 0.80 | 4.000000 |
6 | 0.90 | 9.000000 |
What is e? It is the base rate of growth shared by all continually growing processes:
# exponential function: e^1
np.exp(1)
2.7182818284590451
What is a (natural) log? It gives you the time needed to reach a certain level of growth:
# time needed to grow 1 unit to 2.718 units
np.log(2.718)
0.99989631572895199
It is also the inverse of the exponential function:
np.log(np.exp(5))
5.0
# add log-odds to the table
table['logodds'] = np.log(table.odds)
table
probability | odds | logodds | |
---|---|---|---|
0 | 0.10 | 0.111111 | -2.197225 |
1 | 0.20 | 0.250000 | -1.386294 |
2 | 0.25 | 0.333333 | -1.098612 |
3 | 0.50 | 1.000000 | 0.000000 |
4 | 0.60 | 1.500000 | 0.405465 |
5 | 0.80 | 4.000000 | 1.386294 |
6 | 0.90 | 9.000000 | 2.197225 |
Linear regression: continuous response is modeled as a linear combination of the features:
$$y = \beta_0 + \beta_1x$$Logistic regression: log-odds of a categorical response being "true" (1) is modeled as a linear combination of the features:
$$\log \left({p\over 1-p}\right) = \beta_0 + \beta_1x$$This is called the logit function.
Probability is sometimes written as pi:
$$\log \left({\pi\over 1-\pi}\right) = \beta_0 + \beta_1x$$The equation can be rearranged into the logistic function:
$$\pi = \frac{e^{\beta_0 + \beta_1x}} {1 + e^{\beta_0 + \beta_1x}}$$In other words:
The logistic function has some nice properties:
Notes:
# plot the predicted probabilities again
plt.scatter(glass.al, glass.assorted)
plt.plot(glass.al, assorted_pred_prob, color='red')
[<matplotlib.lines.Line2D at 0x1b45b978>]
# compute predicted log-odds for al=2 using the equation
logodds = logreg.intercept_ + logreg.coef_[0] * 2
logodds
array([ 0.64722323])
# convert log-odds to odds
odds = np.exp(logodds)
odds
array([ 1.91022919])
# convert odds to probability
prob = odds/(1 + odds)
prob
array([ 0.65638445])
# compute predicted probability for al=2 using the predict_proba method
logreg.predict_proba(2)[:, 1]
array([ 0.65638445])
# examine the coefficient for al
zip(feature_cols, logreg.coef_[0])
[('al', 4.1804038614510901)]
Interpretation: A 1 unit increase in 'al' is associated with a 4.18 unit increase in the log-odds of 'assorted'.
# increasing al by 1 (so that al=3) increases the log-odds by 4.18
logodds = 0.64722323 + 4.1804038614510901
odds = np.exp(logodds)
prob = odds/(1 + odds)
prob
0.99205808391674566
# compute predicted probability for al=3 using the predict_proba method
logreg.predict_proba(3)[:, 1]
array([ 0.99205808])
Bottom line: Positive coefficients increase the log-odds of the response (and thus increase the probability), and negative coefficients decrease the log-odds of the response (and thus decrease the probability).
# examine the intercept
logreg.intercept_
array([-7.71358449])
Interpretation: For an 'al' value of 0, the log-odds of 'assorted' is -7.71.
# convert log-odds to probability
logodds = logreg.intercept_
odds = np.exp(logodds)
prob = odds/(1 + odds)
prob
array([ 0.00044652])
That makes sense from the plot above, because the probability of assorted=1 should be very low for such a low 'al' value.
Changing the $\beta_0$ value shifts the curve horizontally, whereas changing the $\beta_1$ value changes the slope of the curve.
Advantages of logistic regression:
Disadvantages of logistic regression: