In [41]:
!date

Sat Apr 13 13:33:11 PDT 2013

From: [email protected] [mailto:[email protected]] On Behalf Of Tommy Engström
Sent: Wednesday, April 10, 2013 2:40 PM
To: [email protected]
Subject: [pymc] Naive bayes in PyMC

I'm completely new to PyMC so this is most likely a trivial question but I haven't managed to figure it out.

I thought I'd start my pymc journey by building a naive bayes classifier using pymc but I'm having trouble understanding how to do that.

The problem:

1000 observations of 10 attributes + 1 target attribute, all booleans.
How can fit the network once I have it? I need to be able to set all 11 attributes at once. Or am I just thinking completely wrong here?

Thankful for any guiding words
In [42]:
import pymc as mc


Naive Bayes in its simplest form is a prediction model for categorial data.

I happen to have code from a recent project which simulates data of this form from an interesting distribution:

In [43]:
import random
def simulate(n, p, seed):
random.seed(123456+seed)
mc.np.random.seed(123456+seed)

# make A clusters, beta distributed
A = 5
X_true = mc.rbeta(.5, .5, size=(A,p))
y_true = mc.rbeta(1, 1, size=A)

X = zeros((n,p))
p_true = zeros(n)
for i in range(n):
a_i = random.randrange(A)
X[i] = mc.rbernoulli(X_true[a_i])
p_true[i] = y_true[a_i]

y = mc.rbinomial(1, p_true)

test = random.sample(range(n), n/4)
train = list(set(range(n)) - set(test))

X_train = X[train]
y_train = y[train]

X_test = X[test]
y_test = y[test]

return locals()

In [44]:
n = 1000
p = 10
data = simulate(n=n, p=p, seed=0)


As mentioned on the PyMC list, it is not necessary to use PyMC for naive bayes prediction, and scikits-learn has a very simple way to handle this:

In [45]:
import sklearn.naive_bayes

In [46]:
nb = sklearn.naive_bayes.BernoulliNB(alpha=0, fit_prior=False)
nb.fit(data['X_train'], data['y_train'])

Out[46]:
BernoulliNB(alpha=0, binarize=0.0, class_prior=None, fit_prior=False)
In [47]:
y_pred = nb.predict_proba(data['X_test'])

In [48]:
# accuracy
mean((y_pred[:,1] >= .5) == data['y_test'])

Out[48]:
0.66000000000000003

But there is nothing wrong with implementing the naive bayes model in PyMC; it could be a good project for getting started.

Formulating the naive bayes predictor in PyMC requires writing this machine learning standard in the language of Bayesian statistics, specifically, the "data likelihood" of $X_j | y$: $$X_j \sim Be(\alpha_j y + \beta_j (1-y)),$$ where $\alpha_j$ is the probability of $X_j$ being 1 when $y$ is 1, and $\beta_j$ is the probability of $X_j$ being 1 when $y$ is 0.

The tradition in ML is not to think much about the prior distribution of $y$, and so I won't: $$y_i \sim Be(.5).$$

In [49]:
n_test = len(data['X_test'])
y = [mc.Bernoulli('y_%d'%i, .5) for i in range(n_test)]

alpha = empty(p)
beta = empty(p)
for j in range(p):
# alpha[j] is Pr[X_j = 1 | y = 1] in training data
alpha[j] = (data['X_train'][:,j] * data['y_train']).sum() / data['y_train'].sum()

# beta[j] is Pr[X_j = 1 | y = 0] in training data
beta[j] = (data['X_train'][:,j] * (1-data['y_train'])).sum() / (1-data['y_train']).sum()

X = [mc.Bernoulli('X_%d_%d'%(i,j), alpha[j]*y[i]+beta[j]*(1-y[i]), value=data['X_test'][i,j], observed=True) for i in range(n_test) for j in range(p)]

In [50]:
mcmc = mc.MCMC([y, X])
%time mcmc.sample(10000)

[****************100%******************]  10000 of 10000 complete CPU times: user 19min 19s, sys: 168 ms, total: 19min 19s
Wall time: 19min 23s

In [51]:
plot([y_i.stats()['mean'] for y_i in y], y_pred[:,1], 's', mec='k', color='none')
plot([0,1], [0,1], 'k--')

Out[51]:
[<matplotlib.lines.Line2D at 0x1b076a10>]
In [51]: