In this notebook we illustrate how slalom can be used to identify biological drivers on the mESC cell cycle staged dataset.
First, we load some modules and set some directories
import os
import slalom
import pdb
from slalom import plotFactors, plotRelevance, plotLoadings, saveFA, dumpFA
%pylab inline
#specify where the hdf5 file is
data_dir = '../data/'
out_dir = './results'
Populating the interactive namespace from numpy and matplotlib
slalom expects an expression file, typically with log transformed gene expression values as well as a gene set annotation. These data can be provided as single hdf5 file, which can be generated using separate R scripts (in the R folder). Alternatively, the expression matrix and the annotation can be loaded as text files in python.
By default we here load the MsigDB annotation, which needs to be downloaded separately from the Broad web page.
####
#Option 1: load a pre-defined hdf5 file
#We provide an (optional) hdf file with the required data - this was generated using
#the R script write_slalom.R in the R folder
if 0:
annoDB = 'MSigDB'
dFile = os.path.join(data_dir,'Buettneretal2015.hdf5')
data = slalom.load_hdf5(dFile, anno=annoDB)
####
####
#Option 2: load the annotation and the data directly
if 1:
#Annotation file
#annoFile = os.path.join(data_dir,'h.all.v5.0.symbols.gmt.txt') #MSigDB
#annoDB = 'MSigDB'
#if not os.path.exists(annoFile):
# raise Exception("MSigDB annotation file needs to be downloaded manually")
#Note: the license of MSigDB does not permit redistribution of the raw annotation files.
#Please register at http://software.broadinstitute.org/gsea/msigdb and download the
#hallmark gene sets and place the file in data folder.
annoFile = os.path.join(data_dir,'c2.cp.reactome.v4.0.symbols.gmt.txt') #REACTOME
annoDB = 'REACTOME'
#dataFile: csv file with log expresison values
dataFile = os.path.join(data_dir,'Buettneretal.csv.gz')
data = slalom.utils.load_txt(dataFile=dataFile,annoFiles=annoFile,annoDBs=annoDB)
####
###alternatively the data can be loaded from the provided hdf5 file
#dFile = 'Buettneretal2015.hdf5'
#data = slalom.load_hdf5(dFile, data_dir=data_dir)
#print statistics for the loaded dataset
print ("Loaded {:d} cells, {:d} genes".format(data['Y'].shape[0],data['Y'].shape[1]))
print ("Annotation: {:d} terms".format(len(data['terms'])))
Data file loaded 0 terms out of 674 terms loaded for current annotation file 10 terms out of 674 terms loaded for current annotation file 20 terms out of 674 terms loaded for current annotation file 30 terms out of 674 terms loaded for current annotation file 40 terms out of 674 terms loaded for current annotation file 50 terms out of 674 terms loaded for current annotation file 60 terms out of 674 terms loaded for current annotation file 70 terms out of 674 terms loaded for current annotation file 80 terms out of 674 terms loaded for current annotation file 90 terms out of 674 terms loaded for current annotation file 100 terms out of 674 terms loaded for current annotation file 110 terms out of 674 terms loaded for current annotation file 120 terms out of 674 terms loaded for current annotation file 130 terms out of 674 terms loaded for current annotation file 140 terms out of 674 terms loaded for current annotation file 150 terms out of 674 terms loaded for current annotation file 160 terms out of 674 terms loaded for current annotation file 170 terms out of 674 terms loaded for current annotation file 180 terms out of 674 terms loaded for current annotation file 190 terms out of 674 terms loaded for current annotation file 200 terms out of 674 terms loaded for current annotation file 210 terms out of 674 terms loaded for current annotation file 220 terms out of 674 terms loaded for current annotation file 230 terms out of 674 terms loaded for current annotation file 240 terms out of 674 terms loaded for current annotation file 250 terms out of 674 terms loaded for current annotation file 260 terms out of 674 terms loaded for current annotation file 270 terms out of 674 terms loaded for current annotation file 280 terms out of 674 terms loaded for current annotation file 290 terms out of 674 terms loaded for current annotation file 300 terms out of 674 terms loaded for current annotation file 310 terms out of 674 terms loaded for current annotation file 320 terms out of 674 terms loaded for current annotation file 330 terms out of 674 terms loaded for current annotation file 340 terms out of 674 terms loaded for current annotation file 350 terms out of 674 terms loaded for current annotation file 360 terms out of 674 terms loaded for current annotation file 370 terms out of 674 terms loaded for current annotation file 380 terms out of 674 terms loaded for current annotation file 390 terms out of 674 terms loaded for current annotation file 400 terms out of 674 terms loaded for current annotation file 410 terms out of 674 terms loaded for current annotation file 420 terms out of 674 terms loaded for current annotation file 430 terms out of 674 terms loaded for current annotation file 440 terms out of 674 terms loaded for current annotation file 450 terms out of 674 terms loaded for current annotation file 460 terms out of 674 terms loaded for current annotation file 470 terms out of 674 terms loaded for current annotation file 480 terms out of 674 terms loaded for current annotation file 490 terms out of 674 terms loaded for current annotation file 500 terms out of 674 terms loaded for current annotation file 510 terms out of 674 terms loaded for current annotation file 520 terms out of 674 terms loaded for current annotation file 530 terms out of 674 terms loaded for current annotation file 540 terms out of 674 terms loaded for current annotation file 550 terms out of 674 terms loaded for current annotation file 560 terms out of 674 terms loaded for current annotation file 570 terms out of 674 terms loaded for current annotation file 580 terms out of 674 terms loaded for current annotation file 590 terms out of 674 terms loaded for current annotation file 600 terms out of 674 terms loaded for current annotation file 610 terms out of 674 terms loaded for current annotation file 620 terms out of 674 terms loaded for current annotation file 630 terms out of 674 terms loaded for current annotation file 640 terms out of 674 terms loaded for current annotation file 650 terms out of 674 terms loaded for current annotation file 660 terms out of 674 terms loaded for current annotation file 670 terms out of 674 terms loaded for current annotation file Processed annotation file ../data/c2.cp.reactome.v4.0.symbols.gmt.txt Loaded 182 cells, 6635 genes Annotation: 674 terms
#I: indicator matrix that assigns genes to pathways
I = data['I'] #if loaded from the hdf file change to I = data['IMSigDB']
#Y: log expresison values
Y = data['Y']
#terms: ther names of the terms
terms = data['terms']
#gene_ids: the ids of the genes in Y
gene_ids = data['genes']
#initialize FA instance, here using a Gaussian noise model and fitting 3 dense hidden factors
FA = slalom.initFA(Y, terms,I, gene_ids=gene_ids, noise='gauss',
nHidden=3, nHiddenSparse=0)
#model training
seed(123)
#FA.learnPi=True
FA.train()
#print diagnostics
FA.printDiagnostics()
iteration 0 Switched off factor Trans golgi network v Switched off factor Traf6 mediated induct Switched off factor Costimulation by the Switched off factor Loss of nlp from mito Switched off factor Hiv infection Switched off factor Trif mediated tlr3 si Switched off factor Downstream signaling Switched off factor Unfolded protein resp Switched off factor Golgi associated vesi Switched off factor S phase Switched off factor Hiv life cycle Switched off factor Mrna processing Switched off factor Signaling by erbb4 Switched off factor Signaling by fgfr Switched off factor Il1 signaling Switched off factor Integration of energy Switched off factor Pi3k events in erbb2 Switched off factor Nucleotide binding do Switched off factor Downstream signaling Switched off factor Nfkb and map kinases Switched off factor Nrage signals death t Switched off factor P75 ntr receptor medi Switched off factor Regulation of insulin Switched off factor Circadian clock Switched off factor Myd88 mal cascade ini Switched off factor Tcr signaling Switched off factor Map kinase activation Switched off factor Signaling by erbb2 Switched off factor Activated tlr4 signal Switched off factor Metabolism of rna Switched off factor Recruitment of mitoti Switched off factor Phospholipase c media Switched off factor Cell death signalling Switched off factor Cell cycle checkpoint Switched off factor Insulin receptor sign Switched off factor Mitotic g2 g2 m phase Switched off factor Dna replication Switched off factor Transport of mature t Switched off factor Transport of glucose Switched off factor Mitotic m m g1 phases Switched off factor Mitotic prometaphase Switched off factor Interferon signaling Switched off factor Metabolism of mrna Switched off factor Ngf signalling via tr Switched off factor G alpha1213 signallin Switched off factor Downstream signal tra Switched off factor L1cam interactions Switched off factor Neurotransmitter rece Switched off factor Late phase of hiv lif Switched off factor Mitotic g1 g1 s phase Switched off factor Processing of capped Switched off factor Antiviral mechanism b Switched off factor Glucose metabolism Switched off factor Antigen processing ub Switched off factor Ppara activates gene Switched off factor Signaling by the b ce Switched off factor Apoptotic execution p Switched off factor Signaling by egfr in Switched off factor Signaling by notch1 Switched off factor Signaling by ils Switched off factor Activation of chapero Switched off factor G alpha q signalling Switched off factor G1 s transition Switched off factor Il 3 5 and gm csf sig Switched off factor Pi 3k cascade Switched off factor Gastrin creb signalli Switched off factor Amino acid and oligop Switched off factor Platelet homeostasis Switched off factor Signaling by tgf beta Switched off factor Signaling by pdgf Switched off factor Factors involved in m Switched off factor Heparan sulfate hepar Switched off factor Nod1 2 signaling path Switched off factor Response to elevated Switched off factor Transcription Switched off factor Notch1 intracellular Switched off factor Cell cycle mitotic Switched off factor Opioid signalling Switched off factor Signaling by insulin Switched off factor Rig i mda5 mediated i Switched off factor Transcriptional regul Switched off factor Class i mhc mediated Switched off factor Glycerophospholipid b Switched off factor Glycosaminoglycan met Switched off factor Synthesis of pips at Switched off factor Chromosome maintenanc Switched off factor Mrna splicing Switched off factor Fatty acid triacylgly Switched off factor Semaphorin interactio Switched off factor Signaling by fgfr in Switched off factor Toll receptor cascade Switched off factor Rna pol ii transcript Switched off factor Diabetes pathways Switched off factor Pi metabolism Switched off factor Transmission across c Switched off factor Signaling by scf kit Switched off factor Gpcr ligand binding Switched off factor Cell surface interact Switched off factor G alpha i signalling Switched off factor Class a1 rhodopsin li Switched off factor Asparagine n linked g Switched off factor Signalling by ngf Switched off factor Innate immune system Switched off factor Sphingolipid metaboli Switched off factor Signaling by notch Switched off factor Mhc class ii antigen Switched off factor Cell cell communicati Switched off factor Generic transcription Switched off factor Platelet activation s Switched off factor Meiosis Switched off factor Phospholipid metaboli Switched off factor Post translational pr Switched off factor Slc mediated transmem Switched off factor Pi3k cascade Switched off factor Apoptosis Switched off factor Biological oxidations Switched off factor Signaling by rho gtpa Switched off factor Neuronal system Switched off factor Dna repair Switched off factor Gpcr downstream signa Switched off factor Hemostasis Switched off factor Metabolism of nucleot Switched off factor Rna pol i rna pol iii Switched off factor Cytokine signaling in Switched off factor Axon guidance Switched off factor Transport of inorgani Switched off factor Metabolism of vitamin Switched off factor Adaptive immune syste Switched off factor Developmental biology Switched off factor Metabolism of protein iteration 100 iteration 200 iteration 300 iteration 400 iteration 500 iteration 600 iteration 700 Converged after 701 iterations Maximally 2% Genes per factor changed.
Next, we plot the results, including factor relevance, gene set augmentation and a scatter plot of the two most relevant factors, in this case Cell Cycle and Integrin cell surface interactions. As for the MSigDB annotations, cell cycle is identified as major driver of variability; in addition the process Integrin cell surface interactions is identified as second major difcer of varibility, reflecting differences in self-renewal capabilities (Hayashi, Yohei, et al. "Integrins regulate mouse embryonic stem cell self‐renewal." Stem cells 25.12 (2007): 3005-3015.)
#plot results
fig = plotRelevance(FA)
#scatter plot of the top two factors
fig = plotFactors(FA, lab=data['lab'], terms=['Cell cycle','Integrin cell surface'], isCont=False)
/Users/flo/projects/Auto_Bionf/scLVM2/slalom/utils.py:403: RuntimeWarning: invalid value encountered in true_divide y_max = SP.ceil(max(n_gain)/gap) /Users/flo/projects/Auto_Bionf/scLVM2/slalom/utils.py:404: RuntimeWarning: invalid value encountered in double_scalars y_min = SP.floor(min(n_loss)/gap)
#visualize weights for the Cell cycle factor
fig = plotLoadings(FA, 'Cell cycle', n_genes=20)
Finally, we can save the model results as CSV files or using a .hdf5 file.
#if not os.path.exists(out_dir):
# os.makedirs(out_dir)
#out_file_name = os.path.join(out_dir,FA.getName()+'.hdf5')
#saveFA(FA, out_name=out_file_name