In this notebook we illustrate how slalom can be used to identify biological drivers on the mESC cell cycle staged dataset.
First, we load some modules and set some directories
import os
import slalom
import pdb
from slalom import plotFactors, plotRelevance, plotLoadings, saveFA, dumpFA
%pylab inline
#specify where the hdf5 file is
data_dir = '../data/'
out_dir = './results'
Populating the interactive namespace from numpy and matplotlib
slalom expects an expression file, typically with log transformed gene expression values as well as a gene set annotation. These data can be provided as single hdf5 file, which can be generated using separate R scripts (in the R folder). Alternatively, the expression matrix and the annotation can be loaded as text files in python.
By default we here load the MsigDB annotation, which needs to be downloaded separately from the Broad web page.
####
#Option 1: load a pre-defined hdf5 file
#We provide an (optional) hdf file with the required data - this was generated using
#the R script write_slalom.R in the R folder
if 0:
annoDB = 'MSigDB'
dFile = os.path.join(data_dir,'Buettneretal2015.hdf5')
data = slalom.load_hdf5(dFile, anno=annoDB)
####
####
#Option 2: load the annotation and the data directly
if 1:
#Annotation file
annoFile = os.path.join(data_dir,'h.all.v5.0.symbols.gmt.txt') #MSigDB
annoDB = 'MSigDB'
if not os.path.exists(annoFile):
raise Exception("MSigDB annotation file needs to be downloaded manually")
#Note: the license of MSigDB does not permit redistribution of the raw annotation files.
#Please register at http://software.broadinstitute.org/gsea/msigdb and download the
#hallmark gene sets and place the file in data folder.
#dataFile: csv file with log expresison values
dataFile = os.path.join(data_dir,'Buettneretal.csv.gz') # note that the first column (row names) contains the cell cycle stage in numeric form
data = slalom.utils.load_txt(dataFile=dataFile,annoFiles=annoFile,annoDBs=annoDB)
####
###alternatively the data can be loaded from the provided hdf5 file
#dFile = 'Buettneretal2015.hdf5'
#data = slalom.load_hdf5(dFile, data_dir=data_dir)
#print statistics for the loaded dataset
print ("Loaded {:d} cells, {:d} genes".format(data['Y'].shape[0],data['Y'].shape[1]))
print ("Annotation: {:d} terms".format(len(data['terms'])))
Loaded 182 cells, 6635 genes Annotation: 50 terms
#I: indicator matrix that assigns genes to pathways
I = data['I'] #if loaded from the hdf file change to I = data['IMSigDB']
#Y: log expresison values
Y = data['Y']
#terms: ther names of the terms
terms = data['terms']
#gene_ids: the ids of the genes in Y
gene_ids = data['genes']
#initialize FA instance, here using a Gaussian noise model and fitting 3 dense hidden factors
FA = slalom.initFA(Y, terms,I, gene_ids=gene_ids, noise='gauss', nHidden=3, minGenes=15)
/Users/flo/projects/Auto_Bionf/scLVM2/slalom/core.py:392: RuntimeWarning: divide by zero encountered in true_divide logPi = SP.log(self.Pi[:,m]/(1-self.Pi[:,m])) /Users/flo/projects/Auto_Bionf/scLVM2/slalom/core.py:394: RuntimeWarning: divide by zero encountered in true_divide logPi = SP.log(self.Pi[:,m]/(1-self.Pi[:,m]))
#model training
FA.train()
#print diagnostics
FA.printDiagnostics()
iteration 0 iteration 100 iteration 200 iteration 300 iteration 400 iteration 500 iteration 600 iteration 700 Converged after 701 iterations Maximally 4.6875 % Genes per factor changed.
Next, we plot the results, including factor relevance, gene set augmentation and a scatter plot of the two most relevant factors, in this case G2M Checkpoint and P53 Pathway.
#plot results
fig = plotRelevance(FA)
#get factors; analogous getters are implemented for relevance and weights (see docs)
X = FA.getX(terms=['G2m checkpoint','P53 pathway'])
#scatter plot of the top two factors
fig = plotFactors(X=X, lab=data['lab'], terms=['G2m checkpoint','P53 pathway'], isCont=False)
#the same plot can be generated by passing the FA object to the plot function
#plotFactors(FA=FA,idx1=0,idx2=1, lab = data['lab'], isCont=False )
#visualize changes for the G2m checkpoint
fig = plotLoadings(FA, 'G2m checkpoint', n_genes=20)
Finally, we can safe the model results as CSV files or using a .hdf5 file.
#if not os.path.exists(out_dir):
# os.makedirs(out_dir)
#out_file_name = os.path.join(out_dir,FA.getName()+'.hdf5')
#saveFA(FA, out_name=out_file_name