Let's see if Bayes' theorem might be able to help us solve a classification task, namely predicting the species of an iris!
We'll load the iris data into a DataFrame, and round up all of the measurements to the next integer:
from sklearn.datasets import load_iris
import numpy as np
import pandas as pd
# load the iris data
iris = load_iris()
# round up the measurements
X = np.ceil(iris.data)
# clean up column names
col_names = [name[:-5].replace(' ', '_') for name in iris.feature_names]
# read into pandas
df = pd.DataFrame(X, columns=col_names)
# create a list of species using iris.target and iris.target_names
species = [iris.target_names[num] for num in iris.target]
# add the species list as a new DataFrame column
df['species'] = species
# print the head
df.head()
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 6 | 4 | 2 | 1 | setosa |
1 | 5 | 3 | 2 | 1 | setosa |
2 | 5 | 4 | 2 | 1 | setosa |
3 | 5 | 4 | 2 | 1 | setosa |
4 | 5 | 4 | 2 | 1 | setosa |
Let's say that I had an out-of-sample observation with the following measurements: 7, 3, 5, 2. I want to predict the species of this iris. How might I do that?
We'll first examine all observations in the training data with those measurements:
# show all observations with features: 7, 3, 5, 2
df[(df.sepal_length==7) & (df.sepal_width==3) & (df.petal_length==5) & (df.petal_width==2)]
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
54 | 7 | 3 | 5 | 2 | versicolor |
58 | 7 | 3 | 5 | 2 | versicolor |
63 | 7 | 3 | 5 | 2 | versicolor |
68 | 7 | 3 | 5 | 2 | versicolor |
72 | 7 | 3 | 5 | 2 | versicolor |
73 | 7 | 3 | 5 | 2 | versicolor |
74 | 7 | 3 | 5 | 2 | versicolor |
75 | 7 | 3 | 5 | 2 | versicolor |
76 | 7 | 3 | 5 | 2 | versicolor |
77 | 7 | 3 | 5 | 2 | versicolor |
87 | 7 | 3 | 5 | 2 | versicolor |
91 | 7 | 3 | 5 | 2 | versicolor |
97 | 7 | 3 | 5 | 2 | versicolor |
123 | 7 | 3 | 5 | 2 | virginica |
126 | 7 | 3 | 5 | 2 | virginica |
127 | 7 | 3 | 5 | 2 | virginica |
146 | 7 | 3 | 5 | 2 | virginica |
# count the species for these observations
df[(df.sepal_length==7) & (df.sepal_width==3) & (df.petal_length==5) & (df.petal_width==2)].species.value_counts()
versicolor 13 virginica 4 dtype: int64
# count the species for all observations
df.species.value_counts()
setosa 50 versicolor 50 virginica 50 dtype: int64
Okay, so how might Bayes' theorem help us here?
Let's frame this as a conditional probability: What is the probability of some particular class, given the measurements 7352?
$$P(class | 7352)$$We could calculate this conditional probability for each of the three classes, and then predict the class with the highest probability:
$$P(setosa | 7352)$$$$P(versicolor | 7352)$$$$P(virginica | 7352)$$Let's start with versicolor:
$$P(versicolor | 7352) = \frac {P(7352 | versicolor) \times P(versicolor)} {P(7352)}$$We'll calculate each of the terms on the right side of the equation:
$$P(7352 | versicolor) = \frac {13} {50} = 0.26$$$$P(versicolor) = \frac {50} {150} = 0.33$$$$P(7352) = \frac {17} {150} = 0.11$$Therefore, Bayes' theorem says the probability of versicolor given these measurements is:
$$P(versicolor | 7352) = \frac {0.26 \times 0.33} {0.11} = 0.76$$Let's repeat this process for the other two classes, though we already know that versicolor will have the highest probability:
$$P(virginica | 7352) = \frac {0.08 \times 0.33} {0.11} = 0.24$$$$P(setosa | 7352) = \frac {0 \times 0.33} {0.11} = 0$$In summary, we framed a classification problem as three conditional probability equations, we used Bayes' theorem to solve those equations, and then we made a prediction by choosing the class with the highest conditional probability.
Let's make some hypothetical adjustments to the data, to demonstrate how Bayes' theorem actually makes intuitive sense:
Pretend that more of the existing versicolors were 7352:
Pretend that most of the existing irises were versicolor:
Pretend that 17 of the setosas were 7352: