Classifying pediatric IBD stool samples (work in progress)

This notebook is a recoding of the analysis used in the PLoSONE paper: Non-Invasive Mapping of the Gastrointestinal Microbiota Identifies Children with Inflammatory Bowel Disease using python, sklearn and pandas.

We thought that the SLiME package, as it was packaged for the publication of the paper became outdated and should probably not be available anymore. This notebook replaces it, replicating the analysis executed on the paper with more up-to-date tools and with a few extra figures. I hope this can be the starting point for others trying to follow the same approach and improve upon it.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
%matplotlib inline

from scipy import interp
from sklearn.ensemble import RandomForestClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.cross_validation import cross_val_score, StratifiedKFold, train_test_split, KFold
from sklearn.metrics import roc_curve, roc_auc_score, auc
from sklearn.preprocessing import LabelEncoder, label_binarize
In [5]:
print('pandas ' + pd.__version__)
print('mpl ' +mpl.__version__)
print('numpy' + np.__version__)
print('seaborn ' + sns.__version__)
pandas 0.17.1
mpl 1.5.0
numpy1.10.2
seaborn 0.7.0

Loading data

The data came from two rounds of 16S sequencing of previously collected stool samples. Here we will use the OTU tables directly, which were created by using the RDP classifier and were subsequently normalized (details in the paper's methods).

Sequencing was performed at the Broad Institute. The first round of sequencing was dubbed CHIMP (Children Hospital IBD Pediatric), while the second round of sequencing -- performed following the request of an anonymous peer reviewer -- was termed blind validation. Its purpose was to further validate the algorithm trained on the CHIMP dataset, as one of the reviewers did not think it was sufficient to use a "leave 20% out" approach on the CHIMP dataset to demonstrate robust prediction.

While these rounds were used as training and test set in the last figure of the paper respectively, it is more useful at this stage to join the two data sets and split them in different random combinations later.

In [6]:
#get the CHIMP training data

X_chimp = pd.read_csv('data/chimp/chimp.Qsorted.rdpout.xtab.norm', delimiter="\t", index_col=0)
y_chimp = pd.read_csv('data/chimp/sampledata.training.chimp.csv', index_col=0)

#just make sure the labels are the same
X_chimp.sort_index(inplace=True)
y_chimp.sort_index(inplace=True)
assert (X_chimp.index == y_chimp.index).all()
In [7]:
## do the same for the blind validation test data
X_blind = pd.read_csv('data/chimp/blind.sorted.rdpout.xtab.norm',
                        delimiter="\t", index_col=0)
y_blind = pd.read_csv('data/chimp/sampledata.validation.blind.csv',
                        index_col=0)

X_blind.sort_index(inplace=True)
y_blind.sort_index(inplace=True)
assert (X_blind.index == y_blind.index).all()
In [8]:
#concatenate using pandas
X = pd.concat([X_chimp, X_blind], keys=['chimp','blind'])
X.head()
Out[8]:
cls_Actinobacteria cls_Alphaproteobacteria cls_Bacilli cls_Bacteroidia cls_Betaproteobacteria cls_Clostridia cls_Cyanobacteria cls_Deltaproteobacteria cls_Epsilonproteobacteria cls_Erysipelotrichi ... phylum_Euryarchaeota phylum_Firmicutes phylum_Fusobacteria phylum_Lentisphaerae phylum_NA phylum_Proteobacteria phylum_Spirochaetes phylum_Synergistetes phylum_TM7 phylum_Verrucomicrobia
chimp 003A 0.000000 0.000549 0.014827 0.002197 0.000275 0.230917 0 0.000000 0 0.000000 ... 0 0.245744 0.023064 0 0.000000 0.728995 0 NaN NaN 0
004A 0.000000 0.000000 0.002486 0.754195 0.000000 0.230889 0 0.000932 0 0.001865 ... 0 0.238036 0.000000 0 0.000000 0.007147 0 NaN NaN 0
005A 0.006521 0.000000 0.026084 0.000000 0.000000 0.908706 0 0.000000 0 0.054451 ... 0 0.990545 0.000000 0 0.000326 0.002608 0 NaN NaN 0
008A 0.000315 0.000000 0.000210 0.837024 0.035995 0.112499 0 0.000000 0 0.000840 ... 0 0.113758 0.000000 0 0.000000 0.048903 0 NaN NaN 0
009A 0.001291 0.000000 0.001550 0.823864 0.024277 0.118027 0 0.000000 0 0.000000 ... 0 0.119576 0.000000 0 0.000000 0.055269 0 NaN NaN 0

5 rows × 284 columns

In [9]:
X.fillna(value=0,inplace=True) #replace NAs with zeroes
In [10]:
y_dx = pd.concat([y_chimp.dx, y_blind.dx], keys=['chimp','blind'])
print(y_dx.head())
print(y_dx.value_counts())
Out[10]:
       sample
chimp  003A      UC
       004A      UC
       005A      NM
       008A      CD
       009A      UC
Name: dx, dtype: object
In [11]:
#convert the training and testing labels to numerical values
le = LabelEncoder()
le.fit(y_dx)
y = le.transform(y_dx)

# just for reference, the columns of the binarized label read respectively:
le.inverse_transform([0,1,2])
Out[11]:
array(['CD', 'NM', 'UC'], dtype=object)

Exploratory data analysis

In [48]:
# map CD,NM and UC to healthy and IBD classes
y_ibd = y_dx.map({'CD':'ibd','UC':'ibd','NM':'healthy'})
y_ibd.value_counts()
Out[48]:
ibd        121
healthy     37
Name: dx, dtype: int64
In [49]:
# split OTU by phylogenetic level
expl = X.copy()
expl.columns = expl.columns.str.partition('_')
expl.head()
Out[49]:
cls ... phylum
_ ... _
Actinobacteria Alphaproteobacteria Bacilli Bacteroidia Betaproteobacteria Clostridia Cyanobacteria Deltaproteobacteria Epsilonproteobacteria Erysipelotrichi ... Euryarchaeota Firmicutes Fusobacteria Lentisphaerae NA Proteobacteria Spirochaetes Synergistetes TM7 Verrucomicrobia
chimp 003A 0.000000 0.000549 0.014827 0.002197 0.000275 0.230917 0 0.000000 0 0.000000 ... 0 0.245744 0.023064 0 0.000000 0.728995 0 0 0 0
004A 0.000000 0.000000 0.002486 0.754195 0.000000 0.230889 0 0.000932 0 0.001865 ... 0 0.238036 0.000000 0 0.000000 0.007147 0 0 0 0
005A 0.006521 0.000000 0.026084 0.000000 0.000000 0.908706 0 0.000000 0 0.054451 ... 0 0.990545 0.000000 0 0.000326 0.002608 0 0 0 0
008A 0.000315 0.000000 0.000210 0.837024 0.035995 0.112499 0 0.000000 0 0.000840 ... 0 0.113758 0.000000 0 0.000000 0.048903 0 0 0 0
009A 0.001291 0.000000 0.001550 0.823864 0.024277 0.118027 0 0.000000 0 0.000000 ... 0 0.119576 0.000000 0 0.000000 0.055269 0 0 0 0

5 rows × 284 columns

In [155]:
# how many healthy people have a given OTU vs how many IBD patients have the same OTU
# we use the mean, dividing by the total number of patients in each group to get percentages
otupresence = expl.astype(bool).groupby(y_ibd).mean()
otup = otupresence.stack(level=[0,2]).reset_index()
plt.figure(figsize=(20,4))
f = sns.stripplot(data=otup,x='level_2',y='_',hue='dx')
f.axes.xaxis.grid(True)
f.axes.yaxis.grid(False)
In [224]:
ibdotu = otupresence.stack(level=[0,2]).unstack('dx').reset_index()
ibdotu.columns.values
ibdotu.columns=['level','name','healthy','ibd']
ibdotu_abd = ibdotu[(ibdotu.healthy < 0.1) & (ibdotu.ibd > 0.05) & (ibdotu.level != 'domain')]
print('We found %s OTUs (out of %s ~= %1.1f %%) that are \n \
present in IBD (> 5%% of patients) and underrepresented (< 10 %%) in healthy controls\n' % \
      (ibdotu_abd.shape[0],ibdotu.shape[0],(ibdotu_abd.shape[0]/ibdotu.shape[0])*100))
print(ibdotu_abd)
We found 30 OTUs (out of 284 ~= 10.6 %) that are 
 present in IBD (> 5% of patients) and underrepresented (< 10 %) in healthy controls

      level                   name   healthy       ibd
8       cls  Epsilonproteobacteria  0.000000  0.090909
10      cls          Flavobacteria  0.081081  0.090909
24   family          Aerococcaceae  0.027027  0.157025
31   family     Campylobacteraceae  0.000000  0.090909
38   family     Corynebacteriaceae  0.027027  0.115702
46   family      Flavobacteriaceae  0.081081  0.090909
54   family       Leptotrichiaceae  0.027027  0.082645
83    genus            Abiotrophia  0.027027  0.157025
88    genus         Actinobacillus  0.000000  0.099174
90    genus        Aggregatibacter  0.000000  0.066116
103   genus        Asaccharobacter  0.081081  0.074380
104   genus              Atopobium  0.000000  0.082645
115   genus          Campylobacter  0.000000  0.090909
116   genus         Capnocytophaga  0.054054  0.082645
124   genus            Citrobacter  0.000000  0.140496
130   genus        Corynebacterium  0.027027  0.115702
160   genus               Kingella  0.027027  0.090909
167   genus           Leptotrichia  0.027027  0.082645
175   genus          Mogibacterium  0.027027  0.074380
177   genus             Morganella  0.000000  0.074380
190   genus             Parvimonas  0.027027  0.165289
192   genus            Pediococcus  0.054054  0.115702
194   genus          Peptoniphilus  0.000000  0.148760
195   genus     Peptostreptococcus  0.054054  0.181818
197   genus          Porphyromonas  0.081081  0.090909
199   genus                Proteus  0.027027  0.123967
219   genus         Staphylococcus  0.027027  0.066116
237   genus              Weissella  0.081081  0.082645
246    ordr      Campylobacterales  0.000000  0.090909
254    ordr       Flavobacteriales  0.081081  0.090909
In [ ]:
 
In [ ]:
# let's make it easier to see.
#multiply ibd percentages by -1
otup.loc[otup['dx'] == 'ibd',"_"] *= -1
In [130]:
plt.figure(figsize=(20,4))
f = sns.stripplot(data=otup,x='level_2',y='_',hue='dx')
f.axes.xaxis.grid(True)
f.axes.yaxis.grid(False)
In [131]:
diff = otup.groupby(['level_1','level_2']).sum().reset_index()
diff.head()
g = sns.FacetGrid(diff[diff.level_1 != 'domain'], col="level_1",sharey=False,size=20, aspect=.3)
g.map(sns.barplot,"_","level_2")
g.set_xlabels('ibd <== enriched in ==> healthy')
Out[131]:
<seaborn.axisgrid.FacetGrid at 0x12f2c5b70>
In [135]:
sns.set_style("whitegrid")
g = sns.FacetGrid(diff[diff.level_1 != 'domain'], col="level_1",sharey=False,size=20, aspect=.3)
g.map(sns.stripplot,"_","level_2")
g.set_xlabels('ibd <== enriched in ==> healthy')

for ax in g.axes.flat:
    # Make the grid horizontal instead of vertical
#     ax.xaxis.grid(False)
    ax.yaxis.grid(True)
Out[135]:
<seaborn.axisgrid.FacetGrid at 0x130481d68>
In [229]:
f = sns.factorplot(x='_',y='level_2',col='dx',data=otup[otup.level_1 == 'genus'],kind='bar',size=20,aspect=.25)
f.set_axis_labels('percentage of patients which has OTU',"")
plt.savefig('OTUs.png')