Before we begin, we will import all the python packages that we need.

In [3]:

from collections import defaultdict
from glob import glob, iglob
import os
import re
import sys

import gffutils
import numpy as np
import pandas as pd

How prepare the inputs from our motor neuron differentiation data to make a `flotilla` package¶

This IPython notebook documents how to create a flotilla package using our motor neuron differentiation data. I have attempted to gradually increase in complexity of file and data management.

Gene expression quantification¶

We quantified gene expression using sailfish v0.6.3, using protein-coding and lncRNA transcripts from Gencode v19, and spikeins from both Fluidigm (RNA_Spike_1, RNA_Spike_4, and RNA_Spike_7 from Ambion, sequences below) and ERCC (fastq.gz files). We used a $k$ -mer size of k=31, and the library type "T=PE:O=><:S=SA" for samples with paired-end reads and "T=SE:S=U" for single-end reads.

Performing sailfish quantification is out of the scope of this notebook, and we will only deal with reading in the files.

Ambion RNA-spike in sequences¶

>RNA_Spike_1
gtggagaaagaaatggctcgtctggcagcatttgatatggatggcactttattgatgcccgaccatcatttaggtgagaaaaccctctctactttggcgcGactgcgtgaacgcgacattaccctcacttttgccacggggcgtcatgcgctggagatgcagcatattctcggggcgctatcgctggatgcgtatttgatTaccggcaacggaacgcgcgtgcattctctggaaggtgaacttttacatcgtgatgatttacctgcggatgtcgcggagctggtgctgtatcagcaatggGatacccgagccagcatgcatatcttcaatgacgacggttggtttaccgggaaagagatccctgcgttgttgcaggcatttgtctatagcggttttcgttAtcagataatcgatgtcaaaaaaatgccactcggcagcgtcaccaagatctgcttctgtggcgatcacgacgatcttacacgcttgcagatccagctataCgaagcattaggcgagcgtgcacatttgtgtttttccgccacggattgcctcgaagtgctgccggtgggctgcaataaaGGCGCTGCATTGACGGTGCTGACCCAACATTTAGGTTTATCGTTGCGCGATTGCATGGCCTTTGGTGATGcgatgaacgatcgcgaaatgttagtcagcgtcggtagcggatttattatgggcaatgcgatgccgcaactgcgcgcggagctcccgcatttaccggtgattaaaaaaaaaaaaaaaaaaaaaaaaaa
>RNA_Spike_4
attcatctgcgtggcgaagaggtggcagccgtctcgttgcaatgcgtcgggccggggcattacgtcgcacaaccacaatcagaatacgcattcatgcgtagataacattcaggcggagaataaaatggcaagagctgtacaccgtagtgggttagtggcgctgggcattgcgacagcgttgatggcatcttgtgcattcgctgccaaagatgtggtggtggcggtaggatcgaatttcaccacgctcgatccgtatgacgcaaatgacacgttatctcaggccgtagcgaaatcgttttaccaggggctgttcggtctggataaagagatgaaactgaaaaacgtgctggcggagagttataccgtttccgatgacggcattacttacaccgtgaaattgcgggaaggcattaaattccaggatggcaccgatttcaacgccgcggcggtgaaagcgaatctggACCGGGCCAGCGATCCGGCGAATCATCTTAAACGCTATAACCTGTATAAGAATATTGCTAAAACGGAAGCgatcgatccgacaacggtaaagattaccctcaaacagccgttctcagcgtttattaatattcttgcCcatccggcgaccgcgatgatttcaccggcagcgctggaaaaatatggcaaggagattggtttttatccggtgggaaccggaccgtatgaactggataccTggaatcagaccgattttgtgaaggtgaaaaaattcgcgggttactggcagccaggattgcccaaactggacagcataacctggcgtccggtggcggataAcaacacccgcgcggcaatgctgcaaaccggtgaagcgcagtttgctttccccattccttacgagcaggccacactgctggagaaaaacaaaaatatcgaGttgatggccagtccgtcaattatgcagcgttatatcagtatgaacgtgacgcaaaagccgttcgataacccgaaggtccgtgaggcgctgaattacgccaaaaaaaaaaaaaaaaaaaaaaaaaaa
>RNA_Spike_7
ggcccggaaaccctgcgtcaggtcacccaacatgccgagcacgtcgttaatgcgctgaatacggaagcgaaactgccctgcaaactggtgttgaaaccgctgggcaccacgccggatgaaatcaccgctatttgccgcgacgcgaattacgacgatcgttgcgctggtctggtggtgtggctgcacaccttctccccggccaaaatgtggatcaacggcctgaccatgctcaacaaaccgttgctgcaattccacacccagttcaacgcggcgctgccgtgggacagtatcgatatggactttatgaacctgaaccagactgcacatggcggtcgcgagttcggcttcattggcgcgcgtatgcgtcagcaacatgccgtggttaccggtcactggcaggataaacaagcccatgagcgtatcggctcctggatgcgtcaggcggtctctaaacaggatacccgtcatctgaaagtctgccgatttggcgataacatgcgtgaagtggcggtcaccgatggcgataaagttgccgcacagatcaagttcggtttctccgtcaatacctgggcggttggcgatctggtgcaggtggtgaactccatcagcgacggcgatgttaacgcgctggtcgatgagtacgaaagctgctacaccatgacgcctgccacacaaatccacggcaaaaaacgacagaacgtgctggaagcggcgcgtattgagctggggatgaagcgtttcctggaacaaggtggcttccacgcgttcaccaccacctttgaagatttgcacggtctgaaacagcttcctggtctggccgtacagcgtctgatgcagcagggttacggctttgcgggcgaaggcgactggaaaactgccgccctgcttcgcatcatgaaggtgatgtcaaccggtctgcagggcggcacctcctttatggaggactacacctatcacttcgagaaaggtaatgacctggtgctcggctcccatatgctgg
aagtctgcccgtcgatcgccgcagaagagaaaccgatcctcgacGTTCAGCATCTCGGTATTGGTGGTAAGGACGATCCTGCCCGCCTGATCTTCAATACCCAAACCGGCCCAGcgattgtcgccagcttgattgatctcggcgatcgttaccgtctactggttaactgcatcgacacggtgaaaacaccgcactccctgCcgaaactgccggtggcgaatgcgctgtggaaagcgcaaccggatctgccaactgcttccgaagcgtggatcctcgctggtggcgcgcaccataccgtctTcagccatgcactgaacctcaacgatatgcgccaattcgccgagatgcacgacattgaaatcacggtgattgataacgacacacgcctgccagcgtttaaagacgcgctgcgctggaacgaagtgtattacgggtttcgtcgctaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

Example `sailfish quant` command¶

sailfish quant --index /projects/ps-yeolab/genomes/hg19/sailfish/gencode.v19.pc_lncRNA_transcripts.ercc_fluidigm_spikein.fa_sailfish_index_k31/ -l "T=PE:O=><:S=SA" -1 M2nd_33_R1_100bp.fastq -2 M2nd_33_R2_100bp.fastq --out sailfish_k31 --threads 8

Reading in the gene expression quantification files¶

We output the gene expression quantification into the folder, /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31. We read in the data using pandas, a Python package for data analysis. We use iglob (iterative glob, more memory efficient) from the Python standard library package glob, to "glob," aka collect all files that follow some pattern. In this case, we want all files that look like this:

/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/*.sailfish/quant_bias_corrected.sf

Where the asterisk, "*" means "any character". We can check this on our filesystem, to see what files actually come up:

In [2]:

ls /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/*.sailfish/quant_bias_corrected.sf

/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_01.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_02.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_03.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_04.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_05.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_06.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_07.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_08.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_09.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_10.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_11.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_12.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_13.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_14.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_15.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_16.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_17.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_18.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_19.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_20.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_21.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_22.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_23.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_24.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_25.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_26.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_27.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_28.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_29.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_30.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_31.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_32.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_33.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_34.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_35.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M1_01.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M1_02.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M1_03.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M1_04.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M1_05.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M1_06.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M1_07.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M1_08.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M1_09.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M1_10.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M1_11.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M1_12.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2_01.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2_02.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2_03.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2_04.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2_05.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2_06.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2_07.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2_08.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2_09.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2_10.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2_11.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2_12.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_01.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_02.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_03.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_04.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_05.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_06.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_07.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_08.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_09.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_10.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_11.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_12.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_13.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_14.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_15.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_16.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_17.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_18.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_19.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_20.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_21.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_22.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_23.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_24.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_25.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_26.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_27.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_28.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_29.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_30.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_31.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_32.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_34.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M3_1.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M3_10.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M3_11.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M3_12.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M3_13.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M3_14.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M3_2.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M3_3.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M3_4.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M3_5.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M3_6.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M3_7.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M3_8.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M3_9.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M4_10.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M4_11.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M4_12.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M4_13.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M4_14.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M4_2.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M4_3.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M4_4.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M4_5.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M4_6.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M4_7.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M4_8.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M4_9.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M6_1.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M6_2.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M6_3.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M6_4.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M6_5.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M6_6.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M6_7.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_01.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_02.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_03.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_04.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_05.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_06.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_07.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_08.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_09.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_10.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_11.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_12.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_13.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_14.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_15.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_16.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_17.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_18.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_19.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_20.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_21.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_22.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_23.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_24.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_25.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_26.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_27.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_28.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_29.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_30.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_31.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_32.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_33.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_34.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_35.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_36.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_37.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N4_10.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N4_11.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N4_12.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N4_13.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N4_14.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N4_2.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N4_3.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N4_4.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N4_5.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N4_6.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N4_7.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N4_8.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N4_9.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N5_1.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N5_10.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N5_11.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N5_12.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N5_13.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N5_14.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N5_2.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N5_3.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N5_4.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N5_5.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N5_6.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N5_7.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N5_8.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N5_9.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P1_01.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P1_02.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P1_03.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P1_04.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P1_05.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P1_06.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P1_07.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P1_08.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P1_09.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P1_10.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P1_11.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P1_12.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P2_01.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P2_02.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P2_03.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P2_04.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P2_05.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P2_06.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P2_07.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P2_08.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P2_09.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P2_10.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P2_11.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P2_12.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P2_13.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P2_14.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P3_01.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P3_02.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P3_03.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P4_10.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P4_11.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P4_12.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P4_13.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P4_14.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P4_2.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P4_3.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P4_4.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P4_5.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P4_6.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P4_7.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P4_8.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P4_9.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P7_1.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P7_10.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P7_11.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P7_12.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P7_3.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P7_4.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P7_5.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P7_6.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P7_7.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P7_9.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P8_1.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P8_10.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P8_11.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P8_12.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P8_2.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P8_3.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P8_4.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P8_5.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P8_6.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P8_7.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P8_9.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P9_1.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P9_2.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P9_3.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P9_4.sailfish/quant_bias_corrected.sf

Let's look at the top of one of these files using "head":

In [3]:

! head /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_01.sailfish/quant_bias_corrected.sf

# [sailfish version]	0.6.3
# [kmer length]	31
# [using canonical kmers]	false
# [command]	sailfish --index /home/obotvinnik/scratch/genomes/hg19/sailfish/gencode.v19.pc_lncRNA_transcripts.ercc_fluidigm_spikein.sailfish_index/transcriptome --counts CVN_01.sailfish/reads.sfc --threads 8 --lutfile /home/obotvinnik/scratch/genomes/hg19/sailfish/gencode.v19.pc_lncRNA_transcripts.ercc_fluidigm_spikein.sailfish_index/transcriptome --iterations 1000 --min_abundance 0 --delta 0.005 --out CVN_01.sailfish/quant.sf 
# Transcript	Length	TPM	RPKM	KPKM	EstimatedNumKmers	EstimatedNumReads
ERCC-00013	808	0	0	0	0	0
ERCC-00012	994	0	0	0	0	0
ERCC-00004	523	280.536	164.787	164.787	42793.2	611.331
ERCC-00003	1023	86.7392	50.9509	50.9509	26650.5	380.721
ERCC-00002	1061	1166.62	685.278	685.278	372160	5316.57

It looks like we don't need the first 5 columns of the file, since its mostly informational. Also, it looks like we'll need to supply our own column names because the regular column names don't follow the exact tab-delimited format, as in, "Transcript" is preceeded by "#" on the fifth line. Additionally, we'll want to make sure the first column is assigned as the row names.

Create the gene expression matrix¶

With these things in mind, we are ready create the gene expression matrix. To create a matrix of transcripts per million (TPM) for all samples, we do:

In [4]:

import pandas as pd
from glob import iglob

directory = '/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31'

tpm_dfs = []
columns = ['transcript', 'length', 'tpm', 'rpkm', 'kpkm', 'EstimatedNumKmers', 'EstimatedNumReads']

for filename in iglob('{}/*.sailfish/quant_bias_corrected.sf'.format(directory)):
    
    # Read "tabluar" data, separated by tabs. 
    # Arguments: 
    # skiprows=5      Skip the first 5 rows
    # names=columns   Use the column names in the list "columns"
    # index_col=0     The first column is the row names (the row names are called the "index" in pandas terms)
    df = pd.read_table(filename, skiprows=5, names=columns, index_col=0)
    
    # Get the "series" (aka single column) of TPM 
    tpm = df.tpm

    # To get the sample ID, split by the folder identifier, "/", 
    # and take the second-to-last item (via "[-2]"), which has the sample
    # id, then split on the period, and take the first item via "[0]"
    sample_id = filename.split('/')[-2].split('.')[0]
    
    # Change the name of the 
    tpm.name = sample_id
    tpm_dfs.append(tpm)
tpm = pd.concat(tpm_dfs, axis=1)

In [5]:

tpm.head()

Out[5]:

	M2nd_06	N5_11	CVN_15	M1_08	M1_11	M6_6	P7_9	M3_6	P4_11	N4_13	...	M2nd_32	P4_6	P7_7	MSA_30	MSA_10	MSA_09	P1_04	P7_12	M4_11	P7_4
transcript
ERCC-00013	0	0	0.00000	0	0.000000	0	0	0	0	0	...	0.000000	0	0	0	0	0.000000	0	0	0	0
ERCC-00012	0	0	0.00000	0	0.040588	0	0	0	0	0	...	0.000000	0	0	0	0	0.000000	0	0	0	0
ERCC-00004	0	0	528.86700	0	0.000000	0	0	0	0	0	...	0.000000	0	0	0	0	0.000000	0	0	0	0
ERCC-00003	0	0	2.20783	0	0.000000	0	0	0	0	0	...	0.000000	0	0	0	0	0.000000	0	0	0	0
ERCC-00002	0	0	1075.16000	0	0.000000	0	0	0	0	0	...	0.171494	0	0	0	0	0.077069	0	0	0	0

5 rows × 257 columns

Great! Now we have a matrix of gene expression. For flotilla, we use the scikit-learn standard of " $\text{samples}\; \times\;\text{genes}$ " matrices, so we want that instead of the instead of the " $\text{genes}\; \times\; \text{samples}$ " we have now. To do so, we need to rotate the matrix, aka "transpose" it with ".T":

In [6]:

tpm = tpm.T

In [7]:

tpm.head()

Out[7]:

transcript	ERCC-00012	ERCC-00004	ERCC-00003	ERCC-00002	ERCC-00019	ERCC-00023	...	ENST00000361739.1\|ENSG00000198712.1\|-\|-\|MT-CO2-201\|MT-CO2\|684\|CDS:1-684\|	ENST00000361899.2\|ENSG00000198899.2\|-\|-\|MT-ATP6-201\|MT-ATP6\|681\|CDS:1-681\|	ENST00000361227.2\|ENSG00000198840.2\|-\|-\|MT-ND3-201\|MT-ND3\|346\|CDS:1-346\|	ENST00000361335.1\|ENSG00000212907.2\|-\|-\|MT-ND4L-201\|MT-ND4L\|297\|CDS:1-297\|	ENST00000361624.2\|ENSG00000198804.2\|-\|-\|MT-CO1-201\|MT-CO1\|1542\|CDS:1-1542\|	ENST00000362079.2\|ENSG00000198938.2\|-\|-\|MT-CO3-201\|MT-CO3\|784\|CDS:1-784\|	ENST00000361381.2\|ENSG00000198886.2\|-\|-\|MT-ND4-201\|MT-ND4\|1378\|CDS:1-1378\|	ENST00000361681.2\|ENSG00000198695.2\|-\|-\|MT-ND6-201\|MT-ND6\|525\|CDS:1-525\|	ENST00000361567.2\|ENSG00000198786.2\|-\|-\|MT-ND5-201\|MT-ND5\|1812\|CDS:1-1812\|	ENST00000361789.2\|ENSG00000198727.2\|-\|-\|MT-CYB-201\|MT-CYB\|1141\|CDS:1-1141\|
M2nd_06	0.000000	0.000	0.00000	0.00	0.046510	0.000000	...	11237.90	4614.73	101.54200	487.545000	15661.800	3977.20	2897.480	1092.210	470.663	7083.11
N5_11	0.000000	0.000	0.00000	0.00	0.312974	0.185788	...	10880.90	3231.51	217.86800	0.250081	521.925	1584.60	285.009	287.382	388.020	2275.80
CVN_15	0.000000	528.867	2.20783	1075.16	138.093000	0.226950	...	9918.57	2950.36	82.84140	852.580000	5966.710	2789.92	3042.450	901.814	1700.220	1519.19
M1_08	0.000000	0.000	0.00000	0.00	0.124148	0.000000	...	4822.03	1825.03	9.81727	898.290000	890.859	1542.99	1597.970	414.629	666.589	2492.60
M1_11	0.040588	0.000	0.00000	0.00	0.061035	0.119161	...	7653.41	1064.44	32.97060	639.129000	4353.380	1919.75	1653.810	241.076	285.977	1500.75

5 rows × 119307 columns

Excellent, now it is rotated.

Separate spike-ins and biological genes¶

Notice how the ERCC spike-ins are around? We don't want them in the gene expression data. From looking at the data, I know that the spike-ins are named "ERCC-###" or "RNA_Spike_#". We will remove the spike-ins using a boolean array, which will tell us whether or not the column name starts with "ENST", which all biological genes here do.

In [8]:

spikein_columns = tpm.columns.map(lambda x: not x.startswith('ENST'))
tpm_spikein = tpm.ix[:, spikein_columns]
tpm_transcripts = tpm.ix[:, ~spikein_columns]

In [9]:

tpm_spikein.head()

Out[9]:

transcript	ERCC-00012	ERCC-00004	ERCC-00003	ERCC-00002	ERCC-00019	ERCC-00023	...	ERCC-00168	ERCC-00171	RNA_Spike_1	RNA_Spike_4	ERCC_vector	RNA_Spike_7
M2nd_06	0.000000	0.000	0.00000	0.00	0.046510	0.000000	...	0.00000	0.000	53249.30	1566.410	0.000000	2236.38000
N5_11	0.000000	0.000	0.00000	0.00	0.312974	0.185788	...	0.08166	0.000	567464.00	21656.700	0.552183	18505.70000
CVN_15	0.000000	528.867	2.20783	1075.16	138.093000	0.226950	...	0.00000	580.993	6097.91	213.196	0.344638	7.44064
M1_08	0.000000	0.000	0.00000	0.00	0.124148	0.000000	...	0.00000	0.000	56437.40	2365.120	0.000000	2431.49000
M1_11	0.040588	0.000	0.00000	0.00	0.061035	0.119161	...	0.00000	0.000	85006.90	3941.410	0.239996	2773.22000

5 rows × 100 columns

In [10]:

tpm_transcripts.head()

Out[10]:

transcript	ENST00000461467.1\|ENSG00000237613.2\|OTTHUMG00000000960.1\|OTTHUMT00000002843.1\|FAM138A-002\|FAM138A\|590\|	ENST00000466430.1\|ENSG00000238009.2\|OTTHUMG00000001096.2\|OTTHUMT00000003225.1\|RP11-34P13.7-001\|RP11-34P13.7\|2748\|	...	ENST00000361739.1\|ENSG00000198712.1\|-\|-\|MT-CO2-201\|MT-CO2\|684\|CDS:1-684\|	ENST00000361899.2\|ENSG00000198899.2\|-\|-\|MT-ATP6-201\|MT-ATP6\|681\|CDS:1-681\|	ENST00000361227.2\|ENSG00000198840.2\|-\|-\|MT-ND3-201\|MT-ND3\|346\|CDS:1-346\|	ENST00000361335.1\|ENSG00000212907.2\|-\|-\|MT-ND4L-201\|MT-ND4L\|297\|CDS:1-297\|	ENST00000361624.2\|ENSG00000198804.2\|-\|-\|MT-CO1-201\|MT-CO1\|1542\|CDS:1-1542\|	ENST00000362079.2\|ENSG00000198938.2\|-\|-\|MT-CO3-201\|MT-CO3\|784\|CDS:1-784\|	ENST00000361381.2\|ENSG00000198886.2\|-\|-\|MT-ND4-201\|MT-ND4\|1378\|CDS:1-1378\|	ENST00000361681.2\|ENSG00000198695.2\|-\|-\|MT-ND6-201\|MT-ND6\|525\|CDS:1-525\|	ENST00000361567.2\|ENSG00000198786.2\|-\|-\|MT-ND5-201\|MT-ND5\|1812\|CDS:1-1812\|	ENST00000361789.2\|ENSG00000198727.2\|-\|-\|MT-CYB-201\|MT-CYB\|1141\|CDS:1-1141\|
M2nd_06	1.510590	0.000000	...	11237.90	4614.73	101.54200	487.545000	15661.800	3977.20	2897.480	1092.210	470.663	7083.11
N5_11	3.535900	0.000000	...	10880.90	3231.51	217.86800	0.250081	521.925	1584.60	285.009	287.382	388.020	2275.80
CVN_15	0.528442	0.000000	...	9918.57	2950.36	82.84140	852.580000	5966.710	2789.92	3042.450	901.814	1700.220	1519.19
M1_08	0.503825	0.000000	...	4822.03	1825.03	9.81727	898.290000	890.859	1542.99	1597.970	414.629	666.589	2492.60
M1_11	0.387460	0.284236	...	7653.41	1064.44	32.97060	639.129000	4353.380	1919.75	1653.810	241.076	285.977	1500.75

5 rows × 119207 columns

Now we have all the transcripts, but for our data, we want the sum of all transcripts for a gene.

Sum all transcripts for a gene¶

In our data, we prefer to deal with ENSEMBL id rather then GENCODE ids, so we will split up the transcript identifier to be only the ENSEMBL id.

In [11]:

ensembl_ids = tpm_transcripts.columns.map(lambda x: x.split('|')[1].split('.')[0])

Now we will set the column names of the tpm_transcripts dataframe to be these ENSEMBL ids.

In [12]:

tpm_transcripts.columns = ensembl_ids

This is where it gets a little magical. We will use pandas' groupby command to take all columns with the same name (i.e. transcripts with the same gene ID), and sum them together in this one amazing command.

In [13]:

tpm_genes = tpm_transcripts.groupby(level=0, axis=1).sum()

Let's compare these dataframes

In [14]:

tpm_transcripts.head()

Out[14]:

	ENSG00000237613	ENSG00000238009	...	ENSG00000198712	ENSG00000198899	ENSG00000198840	ENSG00000212907	ENSG00000198804	ENSG00000198938	ENSG00000198886	ENSG00000198695	ENSG00000198786	ENSG00000198727
M2nd_06	1.510590	0.000000	...	11237.90	4614.73	101.54200	487.545000	15661.800	3977.20	2897.480	1092.210	470.663	7083.11
N5_11	3.535900	0.000000	...	10880.90	3231.51	217.86800	0.250081	521.925	1584.60	285.009	287.382	388.020	2275.80
CVN_15	0.528442	0.000000	...	9918.57	2950.36	82.84140	852.580000	5966.710	2789.92	3042.450	901.814	1700.220	1519.19
M1_08	0.503825	0.000000	...	4822.03	1825.03	9.81727	898.290000	890.859	1542.99	1597.970	414.629	666.589	2492.60
M1_11	0.387460	0.284236	...	7653.41	1064.44	32.97060	639.129000	4353.380	1919.75	1653.810	241.076	285.977	1500.75

5 rows × 119207 columns

In [15]:

tpm_genes.head()

Out[15]:

	ENSG00000000003	ENSG00000000419	ENSG00000000457	ENSG00000000460	ENSG00000000938	ENSG00000001036	ENSG00000001084	ENSG00000001167	...	ENSGR0000214717	ENSGR0000223511	ENSGR0000223571	ENSGR0000226179	ENSGR0000230542	ENSGR0000236017	ENSGR0000236871	ENSGR0000270726
M2nd_06	774.52100	2.707418	41.34380	18.84230	0.00000	2.077373	0.00000	0.000000	...	0.000000	1.466740	3.186130	7.85690	0.589833	4.445910	0.196016	0.00000
N5_11	7.66172	0.000000	3.70781	0.00000	0.00000	0.000000	0.00000	0.000000	...	1.940390	0.440097	0.705532	3.43100	2.220190	0.068773	1.743350	0.00000
CVN_15	91.43570	153.603786	3.49776	4.26198	0.00000	1.058168	0.00000	0.000000	...	2.637697	0.280143	0.294435	4.65545	0.782298	0.000000	0.029957	0.00000
M1_08	125.98200	0.000000	43.97036	29.26900	0.00000	0.380472	5.31102	0.260485	...	84.960700	2.539930	0.390424	6.25367	0.409578	1.356720	0.000000	5.25417
M1_11	2.51551	0.182660	4.81471	0.00000	0.85309	1.951130	3.12804	0.000000	...	1.450610	2.317610	0.259031	5.35454	0.370591	2.623580	4.293600	29.78240

5 rows × 34608 columns

Well, tpm_genes has fewer columns, only 34,608 instead of tpm_transcripts' 119,207. Let's double-check to make sure it worked, using the ENSEMBL id "ENSG00000000419".

In [16]:

ensembl_id = 'ENSG00000000419'
tpm_transcripts[ensembl_id].head()

Out[16]:

	ENSG00000000419	ENSG00000000419	ENSG00000000419
M2nd_06	0.0000	0.586718	2.12070
N5_11	0.0000	0.000000	0.00000
CVN_15	79.6027	0.035286	73.96580
M1_08	0.0000	0.000000	0.00000
M1_11	0.0000	0.000000	0.18266

In [17]:

tpm_genes[ensembl_id].head()

Out[17]:

M2nd_06      2.707418
N5_11        0.000000
CVN_15     153.603786
M1_08        0.000000
M1_11        0.182660
Name: ENSG00000000419, dtype: float64

Yes, for every transcript from the gene ENSG00000000419, we get one number for each sample.

Fantastic! Now we've made our gene expression and spike-in matrices, let's make our mapping stats matrix.

Mapping stats¶

We used RNA-STAR to map our reads. Running RNA-STAR on all our samples is out of the scope of this notebook. We used Gencode v19 to create a genome with a splice junction database. We created the RNA-STAR splice junction database with the command,

STAR --runMode genomeGenerate --genomeDir star_sjdb --genomeFastaFiles /projects/ps-yeolab/genomes/hg19/chromosomes/all.fa  --runThreadN 16 --sjdbGTFfile /projects/ps-yeolab/genomes/hg19/gencode/v19/gencode.v19.annotation.gtf --sjdbOverhang 200

Our genomeParameters.txt file is,

In [18]:

cat /projects/ps-yeolab/genomes/hg19/star_sjdb/genomeParameters.txt

versionGenome	20201
genomeFastaFiles	/projects/ps-yeolab/genomes/hg19/chromosomes/all.fa 
genomeSAindexNbases	14
genomeChrBinNbits	18
genomeSAsparseD	1
sjdbOverhang	200
sjdbFileChrStartEnd	-

An example RNA-STAR command is this,

STAR --runMode alignReads --runThreadN 8 --genomeDir /projects/ps-yeolab/genomes/hg19/star_sjdb --genomeLoad LoadAndRemove --readFilesCommand zcat --readFilesIn M1_01_TAAGGCGA-TATCCTCT_L007_R1.fastq.gz M1_01_TAAGGCGA-TATCCTCT_L007_R2.fastq.gz --outFileNamePrefix ./M1_01. --outReadsUnmapped Fastx --outFilterMismatchNmax 5 --outFilterMismatchNoverLmax 0.3 --outFilterMultimapNmax 5 --outFilterScoreMin 10 --outFilterType BySJout --outSAMattributes All --outSAMstrandField intronMotif --clip5pNbases 0 --clip3pNbases 0

RNA-STAR outputs a final log file which indicates how many reads were input to the mapper, how many were finally mapped, and more. These files look like this:

In [19]:

cat /home/obotvinnik/scratch/mn_diff_singlecell/fastq/mapping_stats/M1_01.Log.final.out

                                 Started job on |	Apr 29 19:58:18
                             Started mapping on |	Apr 29 19:58:40
                                    Finished on |	Apr 29 20:01:22
       Mapping speed, Million of reads per hour |	274.58

                          Number of input reads |	12356210
                      Average input read length |	184
                                    UNIQUE READS:
                   Uniquely mapped reads number |	10684874
                        Uniquely mapped reads % |	86.47%
                          Average mapped length |	181.36
                       Number of splices: Total |	4629200
            Number of splices: Annotated (sjdb) |	4559773
                       Number of splices: GT/AG |	4594566
                       Number of splices: GC/AG |	23888
                       Number of splices: AT/AC |	5051
               Number of splices: Non-canonical |	5695
                      Mismatch rate per base, % |	0.37%
                         Deletion rate per base |	0.01%
                        Deletion average length |	1.57
                        Insertion rate per base |	0.01%
                       Insertion average length |	1.43
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |	282305
             % of reads mapped to multiple loci |	2.28%
        Number of reads mapped to too many loci |	19107
             % of reads mapped to too many loci |	0.15%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |	0.00%
                 % of reads unmapped: too short |	11.02%
                     % of reads unmapped: other |	0.07%

This is more complicated to parse than the expression data, so I've written a few commands to make this easier.

In [20]:

from glob import glob  # We use "glob" instead of "iglob" here because we want the whole list around
import numpy as np

def make_unique(seq, idfun=None):
    '''
    if an object appears more than once in a list, append a letter to it

    Modified from: http://www.peterbe.com/plog/uniqifiers-benchmark
    '''
    if idfun is None:
        def idfun(x): return x
    seen = {}
    result = []
    for item in seq:
        marker = idfun(item)
        # in old Python versions:
        # if seen.has_key(marker)
        # but in new ones:
        if marker in seen:
            seen[marker] += 1
            result.append(item + string.lowercase[seen[marker] - 2])
            continue
        seen[marker] = 1
        result.append(item)
    return result

def try_converting_to_float(x):
    try:
        return float(x)
    except ValueError:
        return x
    
def fix_duplicate_columns(mapping_stats):
    '''
    Given a dataframe of mapping stats with "duplicate" column names
    (actually duplicate column names aren't allowed in pandas,
    but this assumes you have columns like M1_01 and M1_01a, and the M1_01a
    column was created from the second *.Log.final.out file from RNA-STAR),
    this detects the duplicate columns and sends them to merge_mapping_stats,
    which combines the duplicate colmns into single column in a reasonable way
    '''
    duplicate_columns = mapping_stats.filter(regex='[a-z]$')
    mapping_stats.drop((x[:-1] for x in duplicate_columns), axis=1)
    merged_columns = {}
    for col in duplicate_columns:
        original_col = col[:-1]
        merged_columns[original_col] = merge_mapping_stats(
            mapping_stats.ix[:, col], mapping_stats.ix[:, original_col])

    # Remove duplicately-named columns, e.g. M1_01 and M1_01a
    mapping_stats = mapping_stats.drop((x for x in duplicate_columns), axis=1)
    mapping_stats = mapping_stats.drop((x[:-1] for x in duplicate_columns),
                                       axis=1)

    merged_df = pd.DataFrame(data=merged_columns, index=mapping_stats.index)
    mapping_stats = pd.concat((mapping_stats, merged_df), axis=1)
    return mapping_stats

def log_final_out(glob_command, ids_function):
    """
    Given a glob command describing where all the Log.final.out files are from
    STAR, return a pd.DataFrame with each sample (id) as its own column.

    glob_command : str
        Where to look for the files, will be passed to "glob"
    ids_function : function
        A function (could be an anonymous function like a
        lambda) that specifies how to get the sample ID from the filename. Could
        also be a list of IDs, but they must be in the exact order as in the
        directories, which is why a function can be easier.

    Example:
    >>> glob_command = '/Users/olga/workspace-git/single_cell/analysis/mapping_stats/*.Log.final.out'
    >>> mapping_stats = log_final_out(glob_command,
    ...             lambda x: '_'.join(x.split('/')[-1].split('_')[:2]))
    """
    series = []
    filenames = glob(glob_command)

    if isinstance(ids_function, list):
        ids = ids_function
    else:
        ids = [ids_function(filename) for filename in filenames]

    for filename in filenames:
        s = pd.read_table(filename, header=None, index_col=0, squeeze=True)
        converted = [try_converting_to_float(x.strip('%'))
                     if type(x) != float else x for x in s]
        series.append(pd.Series(converted, index=s.index))

    mapping_stats = pd.concat(series, keys=make_unique(ids), axis=1)
    new_index = [str(x).strip().rstrip(' |') for x in mapping_stats.index]
    mapping_stats.index = new_index
    mapping_stats = mapping_stats.dropna(how='all')
    mapping_stats = fix_duplicate_columns(mapping_stats)

    # Turn all the number of splicing events into percentages for statistical
    # testing
    number_splicing_event_names = ['Number of splices: Annotated (sjdb)',
                                   'Number of splices: GT/AG',
                                   'Number of splices: GC/AG',
                                   'Number of splices: AT/AC',
                                   'Number of splices: Non-canonical']
    percent_splicing_event_names = [x.replace('Number of', '%')
                                    for x in number_splicing_event_names]
    total_splicing_events = mapping_stats.ix['Number of splices: Total',
                            :].replace(0, np.nan).values.astype(float)

    pieces = []
    for num_events in zip(number_splicing_event_names):
        pieces.append(100.0 * mapping_stats.ix[num_events,
                              :].values \
                      / total_splicing_events)
    pieces = [np.reshape(piece, len(mapping_stats.columns)) for piece in pieces]
    percent_splicing = pd.DataFrame(pieces, index=percent_splicing_event_names,
                                    columns=mapping_stats.columns)
    return pd.concat((mapping_stats, percent_splicing))

The files are all in /home/obotvinnik/scratch/mn_diff_singlecell/fastq/mapping_stats/, so we set our glob command here:

In [21]:

glob_command = '/home/obotvinnik/scratch/mn_diff_singlecell/fastq/mapping_stats/*.Log.final.out'

To get the ID from the filename, we will use a cute feature of Python which is called "lambda" functions, or "anonymous" functions that do something small.

In [22]:

import os

ids_function = lambda x: os.path.basename(x).split('.')[0]

What we've done here is take the "basename" of the file, meaning the name of the file without all the folders before it:

In [23]:

x = '/home/obotvinnik/scratch/mn_diff_singlecell/fastq/mapping_stats/M1_01.Log.final.out'
os.path.basename(x)

Out[23]:

'M1_01.Log.final.out'

And then we split on the period and get the first item

In [24]:

os.path.basename(x).split('.')[0]

Out[24]:

'M1_01'

In [25]:

mapping_stats = log_final_out(glob_command, ids_function)

In [26]:

mapping_stats.head()

Out[26]:

	M2_02	M2nd_06	M3_4	M2_03	N5_3	M2_09	P1_11	M2nd_27	P7_12	M3_6	...	MSA_09	M4_10	P4_5	P7_10	M2_10	MSA_05	P7_1	N4_10	P1_07	MSA_37
Started job on	Apr 29 18:26:01	Apr 29 19:34:06	Apr 29 19:59:10	Apr 29 18:43:16	Apr 29 19:00:06	Apr 29 19:28:44	Apr 29 19:33:53	Apr 29 18:40:07	Apr 29 18:59:50	Apr 29 19:38:15	...	Apr 29 18:56:06	Apr 30 00:32:04	Apr 29 18:41:49	Apr 29 19:24:10	Apr 29 18:36:03	Apr 29 19:04:08	Apr 29 18:08:00	Apr 29 19:22:36	Apr 29 19:48:11	Apr 29 18:46:46
Started mapping on	Apr 29 18:26:38	Apr 29 19:35:23	Apr 29 20:00:54	Apr 29 18:43:33	Apr 29 19:00:24	Apr 29 19:29:05	Apr 29 19:34:12	Apr 29 18:41:01	Apr 29 19:00:09	Apr 29 19:38:33	...	Apr 29 18:56:25	Apr 30 00:32:42	Apr 29 18:42:15	Apr 29 19:24:30	Apr 29 18:36:25	Apr 29 19:04:33	Apr 29 18:09:47	Apr 29 19:22:59	Apr 29 19:48:54	Apr 29 18:47:18
Finished on	Apr 29 18:29:59	Apr 29 19:37:07	Apr 29 20:03:33	Apr 29 18:46:39	Apr 29 19:07:51	Apr 29 19:32:06	Apr 29 19:38:07	Apr 29 18:41:41	Apr 29 19:04:10	Apr 29 19:40:48	...	Apr 29 18:58:02	Apr 30 00:41:16	Apr 29 18:56:23	Apr 29 19:28:38	Apr 29 18:40:01	Apr 29 19:06:09	Apr 29 18:27:44	Apr 29 19:38:59	Apr 29 19:52:13	Apr 29 18:48:58
Mapping speed, Million of reads per hour	233.39	214.27	135.81	284.74	149.2	287.58	296.02	126.21	229.51	211.58	...	211.62	207.09	198.79	213.37	307.66	196.75	200.54	158.97	270.91	126.15
Number of input reads	1.303068e+07	6190057	5998270	1.47114e+07	1.852628e+07	1.445902e+07	1.93232e+07	1402360	1.536468e+07	7934410	...	5701984	2.956736e+07	4.68265e+07	1.46987e+07	1.845972e+07	5246777	5.999607e+07	4.239144e+07	1.497556e+07	3504200

5 rows × 256 columns

Again, we want to have a $\text{samples}\;\times\;\text{features}$ matrix, so we will transpose.

In [27]:

mapping_stats = mapping_stats.T

Splicing¶

Now let's read in the splicing data. This gets complicated because of the filtering. We use MISO (v0.5.1) to quantify gene expression on our data, for all available splice types:

Skipped exon (SE)
Retained intron (RI)
Alternative 5' splice site (A5SS)
Alternative 3' splice site (A3SS)
Mutually exclusive exons (MXE)
Alternative first exon (AFE)
Alternative last exon (ALE)
Tandem UTRs (TANDEMUTR)

This is a figure showing what the different splice types look like, from Wang ET, et al. Nature (2008) paper:

Splice types

Again, running MISO is out of the scope of this notebook, and we will focus on reading in the final files.

Here is an example of a script to calculate the "percent spliced-in" (PSI or " $\Psi$ ") scores. I didn't say it was pretty. One note is that we treat all samples as if they are single-end reads, because I believe MISO's paired-end read filters to be too strict (with the mean and standard deviation of the insert size), and then a lot of our data is thrown out.

In [28]:

cat /home/obotvinnik/scratch/mn_diff_singlecell/fastq/M1_01.Aligned.out.sam.bam.sorted.bam.miso.sh

#!/bin/bash
# Finding all MISO splicing scores for sample: M1_01. Yay!

READ_LEN=$(samtools view /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/M1_01.Aligned.out.sam.bam.sorted.bam | head -n 1 | cut -f 10 | awk '{ print length }')


# calculate Psi scores for all SE events
mkdir -p /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/SE
python /projects/ps-yeolab/software/bin/miso          --run /projects/ps-yeolab/genomes/hg19/miso/SE_index          /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/M1_01.Aligned.out.sam.bam.sorted.bam --output-dir /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/SE          --read-len $READ_LEN          --settings-filename /projects/ps-yeolab/genomes/hg19/miso_annotations/miso_settings_min_event_reads10.txt  -p 16  > /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/SE/psi.out  2> /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/SE/psi.err

# Check that the psi calculation jobs didn't fail.
#'-z' returns true when a string is empty, so this is checking that grepping these files for the words 'failed' and 'shutdown' didn't find anything.
iffailed=$(grep failed /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/SE/psi.out)
ifshutdown=$(grep shutdown /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/SE/psi.err)
if [ ! -z "$iffailed" -o ! -z "$ifshutdown" ] ; then
    #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/SE
    echo "MISO psi failed on event type: SE"
    exit 1
fi

# Summarize psi scores for all SE events
python /home/yeo-lab/software/bin/run_miso.py --summarize-samples /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/SE /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/SE >/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/SE/summary.out 2>/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/SE/summary.err

# Check that the summary jobs didn't fail
# '-s' returns true if file size is nonzero, and the error file should be empty.
if [ -s /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/SE/summary.err ] ; then
    #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/SE

    echo 'MISO psi failed on event type: SE'
    exit 1
fi



# calculate Psi scores for all MXE events
mkdir -p /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/MXE
python /projects/ps-yeolab/software/bin/miso          --run /projects/ps-yeolab/genomes/hg19/miso/MXE_index          /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/M1_01.Aligned.out.sam.bam.sorted.bam --output-dir /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/MXE          --read-len $READ_LEN          --settings-filename /projects/ps-yeolab/genomes/hg19/miso_annotations/miso_settings_min_event_reads10.txt  -p 16  > /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/MXE/psi.out  2> /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/MXE/psi.err

# Check that the psi calculation jobs didn't fail.
#'-z' returns true when a string is empty, so this is checking that grepping these files for the words 'failed' and 'shutdown' didn't find anything.
iffailed=$(grep failed /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/MXE/psi.out)
ifshutdown=$(grep shutdown /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/MXE/psi.err)
if [ ! -z "$iffailed" -o ! -z "$ifshutdown" ] ; then
    #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/MXE
    echo "MISO psi failed on event type: MXE"
    exit 1
fi

# Summarize psi scores for all MXE events
python /home/yeo-lab/software/bin/run_miso.py --summarize-samples /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/MXE /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/MXE >/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/MXE/summary.out 2>/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/MXE/summary.err

# Check that the summary jobs didn't fail
# '-s' returns true if file size is nonzero, and the error file should be empty.
if [ -s /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/MXE/summary.err ] ; then
    #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/MXE

    echo 'MISO psi failed on event type: MXE'
    exit 1
fi



# calculate Psi scores for all AFE events
mkdir -p /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/AFE
python /projects/ps-yeolab/software/bin/miso          --run /projects/ps-yeolab/genomes/hg19/miso/AFE_index          /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/M1_01.Aligned.out.sam.bam.sorted.bam --output-dir /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/AFE          --read-len $READ_LEN          --settings-filename /projects/ps-yeolab/genomes/hg19/miso_annotations/miso_settings_min_event_reads10.txt  -p 16  > /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/AFE/psi.out  2> /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/AFE/psi.err

# Check that the psi calculation jobs didn't fail.
#'-z' returns true when a string is empty, so this is checking that grepping these files for the words 'failed' and 'shutdown' didn't find anything.
iffailed=$(grep failed /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/AFE/psi.out)
ifshutdown=$(grep shutdown /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/AFE/psi.err)
if [ ! -z "$iffailed" -o ! -z "$ifshutdown" ] ; then
    #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/AFE
    echo "MISO psi failed on event type: AFE"
    exit 1
fi

# Summarize psi scores for all AFE events
python /home/yeo-lab/software/bin/run_miso.py --summarize-samples /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/AFE /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/AFE >/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/AFE/summary.out 2>/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/AFE/summary.err

# Check that the summary jobs didn't fail
# '-s' returns true if file size is nonzero, and the error file should be empty.
if [ -s /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/AFE/summary.err ] ; then
    #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/AFE

    echo 'MISO psi failed on event type: AFE'
    exit 1
fi



# calculate Psi scores for all ALE events
mkdir -p /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/ALE
python /projects/ps-yeolab/software/bin/miso          --run /projects/ps-yeolab/genomes/hg19/miso/ALE_index          /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/M1_01.Aligned.out.sam.bam.sorted.bam --output-dir /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/ALE          --read-len $READ_LEN          --settings-filename /projects/ps-yeolab/genomes/hg19/miso_annotations/miso_settings_min_event_reads10.txt  -p 16  > /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/ALE/psi.out  2> /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/ALE/psi.err

# Check that the psi calculation jobs didn't fail.
#'-z' returns true when a string is empty, so this is checking that grepping these files for the words 'failed' and 'shutdown' didn't find anything.
iffailed=$(grep failed /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/ALE/psi.out)
ifshutdown=$(grep shutdown /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/ALE/psi.err)
if [ ! -z "$iffailed" -o ! -z "$ifshutdown" ] ; then
    #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/ALE
    echo "MISO psi failed on event type: ALE"
    exit 1
fi

# Summarize psi scores for all ALE events
python /home/yeo-lab/software/bin/run_miso.py --summarize-samples /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/ALE /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/ALE >/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/ALE/summary.out 2>/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/ALE/summary.err

# Check that the summary jobs didn't fail
# '-s' returns true if file size is nonzero, and the error file should be empty.
if [ -s /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/ALE/summary.err ] ; then
    #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/ALE

    echo 'MISO psi failed on event type: ALE'
    exit 1
fi



# calculate Psi scores for all A3SS events
mkdir -p /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A3SS
python /projects/ps-yeolab/software/bin/miso          --run /projects/ps-yeolab/genomes/hg19/miso/A3SS_index          /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/M1_01.Aligned.out.sam.bam.sorted.bam --output-dir /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A3SS          --read-len $READ_LEN          --settings-filename /projects/ps-yeolab/genomes/hg19/miso_annotations/miso_settings_min_event_reads10.txt  -p 16  > /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A3SS/psi.out  2> /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A3SS/psi.err

# Check that the psi calculation jobs didn't fail.
#'-z' returns true when a string is empty, so this is checking that grepping these files for the words 'failed' and 'shutdown' didn't find anything.
iffailed=$(grep failed /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A3SS/psi.out)
ifshutdown=$(grep shutdown /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A3SS/psi.err)
if [ ! -z "$iffailed" -o ! -z "$ifshutdown" ] ; then
    #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A3SS
    echo "MISO psi failed on event type: A3SS"
    exit 1
fi

# Summarize psi scores for all A3SS events
python /home/yeo-lab/software/bin/run_miso.py --summarize-samples /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A3SS /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A3SS >/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A3SS/summary.out 2>/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A3SS/summary.err

# Check that the summary jobs didn't fail
# '-s' returns true if file size is nonzero, and the error file should be empty.
if [ -s /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A3SS/summary.err ] ; then
    #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A3SS

    echo 'MISO psi failed on event type: A3SS'
    exit 1
fi



# calculate Psi scores for all A5SS events
mkdir -p /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A5SS
python /projects/ps-yeolab/software/bin/miso          --run /projects/ps-yeolab/genomes/hg19/miso/A5SS_index          /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/M1_01.Aligned.out.sam.bam.sorted.bam --output-dir /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A5SS          --read-len $READ_LEN          --settings-filename /projects/ps-yeolab/genomes/hg19/miso_annotations/miso_settings_min_event_reads10.txt  -p 16  > /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A5SS/psi.out  2> /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A5SS/psi.err

# Check that the psi calculation jobs didn't fail.
#'-z' returns true when a string is empty, so this is checking that grepping these files for the words 'failed' and 'shutdown' didn't find anything.
iffailed=$(grep failed /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A5SS/psi.out)
ifshutdown=$(grep shutdown /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A5SS/psi.err)
if [ ! -z "$iffailed" -o ! -z "$ifshutdown" ] ; then
    #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A5SS
    echo "MISO psi failed on event type: A5SS"
    exit 1
fi

# Summarize psi scores for all A5SS events
python /home/yeo-lab/software/bin/run_miso.py --summarize-samples /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A5SS /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A5SS >/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A5SS/summary.out 2>/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A5SS/summary.err

# Check that the summary jobs didn't fail
# '-s' returns true if file size is nonzero, and the error file should be empty.
if [ -s /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A5SS/summary.err ] ; then
    #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A5SS

    echo 'MISO psi failed on event type: A5SS'
    exit 1
fi



# calculate Psi scores for all RI events
mkdir -p /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/RI
python /projects/ps-yeolab/software/bin/miso          --run /projects/ps-yeolab/genomes/hg19/miso/RI_index          /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/M1_01.Aligned.out.sam.bam.sorted.bam --output-dir /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/RI          --read-len $READ_LEN          --settings-filename /projects/ps-yeolab/genomes/hg19/miso_annotations/miso_settings_min_event_reads10.txt  -p 16  > /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/RI/psi.out  2> /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/RI/psi.err

# Check that the psi calculation jobs didn't fail.
#'-z' returns true when a string is empty, so this is checking that grepping these files for the words 'failed' and 'shutdown' didn't find anything.
iffailed=$(grep failed /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/RI/psi.out)
ifshutdown=$(grep shutdown /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/RI/psi.err)
if [ ! -z "$iffailed" -o ! -z "$ifshutdown" ] ; then
    #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/RI
    echo "MISO psi failed on event type: RI"
    exit 1
fi

# Summarize psi scores for all RI events
python /home/yeo-lab/software/bin/run_miso.py --summarize-samples /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/RI /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/RI >/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/RI/summary.out 2>/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/RI/summary.err

# Check that the summary jobs didn't fail
# '-s' returns true if file size is nonzero, and the error file should be empty.
if [ -s /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/RI/summary.err ] ; then
    #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/RI

    echo 'MISO psi failed on event type: RI'
    exit 1
fi



# calculate Psi scores for all TANDEMUTR events
mkdir -p /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/TANDEMUTR
python /projects/ps-yeolab/software/bin/miso          --run /projects/ps-yeolab/genomes/hg19/miso/TANDEMUTR_index          /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/M1_01.Aligned.out.sam.bam.sorted.bam --output-dir /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/TANDEMUTR          --read-len $READ_LEN          --settings-filename /projects/ps-yeolab/genomes/hg19/miso_annotations/miso_settings_min_event_reads10.txt  -p 16  > /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/TANDEMUTR/psi.out  2> /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/TANDEMUTR/psi.err

# Check that the psi calculation jobs didn't fail.
#'-z' returns true when a string is empty, so this is checking that grepping these files for the words 'failed' and 'shutdown' didn't find anything.
iffailed=$(grep failed /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/TANDEMUTR/psi.out)
ifshutdown=$(grep shutdown /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/TANDEMUTR/psi.err)
if [ ! -z "$iffailed" -o ! -z "$ifshutdown" ] ; then
    #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/TANDEMUTR
    echo "MISO psi failed on event type: TANDEMUTR"
    exit 1
fi

# Summarize psi scores for all TANDEMUTR events
python /home/yeo-lab/software/bin/run_miso.py --summarize-samples /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/TANDEMUTR /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/TANDEMUTR >/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/TANDEMUTR/summary.out 2>/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/TANDEMUTR/summary.err

# Check that the summary jobs didn't fail
# '-s' returns true if file size is nonzero, and the error file should be empty.
if [ -s /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/TANDEMUTR/summary.err ] ; then
    #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/TANDEMUTR

    echo 'MISO psi failed on event type: TANDEMUTR'
    exit 1
fi

To read in the summary files that are created, I wrote this function:

In [1]:

def read_miso_summary(filename):
    '''Read a miso summary, with some additional columns

    Reads a MISO summary file as a pandas dataframe, and adds these columns:
    1. a copy-paste-able genome location at the end, based on the minimum
       mRNA_starts and maximum mRNA_ends. (df.genome_location)
    2. The difference between df.ci_high and df.ci_low (df.ci_diff)
    3. The left and right halves of the confidence interval, e.g. the right
       half is df.ci_high - df.miso_posterior_mean. (df.ci_left_half and
       df.ci_right_half)
    4. The max of the two left and right confidence interval halves
       (df.ci_halves_max)
    
    Parameters
    ----------
    filename : str or file buffer
        Name of the miso 
    
    Returns
    -------
    miso_summary : pandas.DataFrame
        A DataFrame of the miso summary for this sample and splice type
    '''
    df = pd.read_table(filename)
    genome_location = pd.DataFrame(
        ['%s:%d-%d' % (chrom, min_csv(starts), max_csv(stops))
         for chrom, starts, stops in zip(df.chrom,
                                         df.mRNA_starts,
                                         df.mRNA_ends)],
        columns=['genome_location'], index=df.index)
    ci_diff = pd.DataFrame(df.ci_high - df.ci_low, columns=['ci_diff'],
                           index=df.index)
    ci_halves = pd.DataFrame(
        {'ci_left_half': (df.ci_high - df.miso_posterior_mean),
         'ci_right_half': (df.miso_posterior_mean - df.ci_low)},
        index=df.index)
    ci_halves_max = pd.DataFrame(ci_halves.max(axis=1),
                                        columns=['ci_halves_max'])
    return pd.concat([df, genome_location, ci_diff, ci_halves,
                      ci_halves_max], axis=1)

def max_csv(x):
    '''
    Of integers separated by commas, take the max
    e.g. 75112684,75112684 would return 75112684
    or 75112684,75112689 would return 75112689
    '''
    return max(map(int, x.split(',')))

def min_csv(x):
    '''
    Of integers separated by commas, take the minimum
    e.g. 75112684,75112684 would return 75112684
    or 75112684,75112689 would return 75112684
    '''
    return min(map(int, x.split(',')))

The miso summaries are in /home/obotvinnik/scratch/mn_diff_singlecell/fastq/miso_v1/SAMPLE_ID/SPLICE_TYPE/summary/SPLICE_TYPE.miso_summary

Some of the AFE and ALE splicing events have the same name, so I will add a prefix of "AFE|"/"ALE|" to the splicing event id ("event_name")

In [4]:

dfs = []

glob_command = '/home/obotvinnik/scratch/mn_diff_singlecell/fastq/miso/*/*/summary/*.miso_summary'

afe_ale = ('AFE', 'ALE')

for filename in iglob(glob_command):
    
    # Try to read in the data, unless it doesn't work, in which case,
    # just skip this file and continue on to the next
    try:
        df = read_miso_summary(filename)
    except:
        continue
    
    # Add columns for the sample id and splice type
    sample_id = filename.split('/')[7]
    splice_type = os.path.basename(filename).split('.')[0]
    df['sample_id'] = sample_id
    df['splice_type'] = splice_type
    
    if splice_type in afe_ale:
        df.event_name  = splice_type + '|' + df.event_name 
    
    dfs.append(df)

Let's see what one of these dataframes looks like.

In [5]:

dfs[0].head()

Out[5]:

	event_name	miso_posterior_mean	ci_low	ci_high	isoforms	counts	assigned_counts	chrom	strand	mRNA_starts	mRNA_ends	genome_location	ci_diff	ci_left_half	ci_right_half	ci_halves_max	sample_id	splice_type
0	chr20:17870855:17870999:-@chr20:17870242:17870...	0.98	0.94	1.00	'chr20:17870855:17870999:-@chr20:17870242:1787...	(0,0):81,(1,0):174,(1,1):13	0:187	chr20	-	17922242,17922855	17922999,17922999	chr20:17922242-17922999	0.06	0.02	0.04	0.04	M3_11	TANDEMUTR
1	chr20:32559260:32559399:+@chr20:32559400:32562...	1.00	0.99	1.00	'chr20:32559260:32559399:+@chr20:32559400:3256...	(1,0):704	0:704	chr20	+	33095599,33095599	33099198,33095738	chr20:33095599-33099198	0.01	0.00	0.01	0.01	M3_11	TANDEMUTR
2	chr20:30247080:30247914:-@chr20:30243970:30247...	1.00	0.99	1.00	'chr20:30247080:30247914:-@chr20:30243970:3024...	(0,0):1,(1,0):287	0:287	chr20	-	30780309,30783419	30784253,30784253	chr20:30780309-30784253	0.01	0.00	0.01	0.01	M3_11	TANDEMUTR
3	chr20:48800432:48803519:+@chr20:48803520:48803...	0.12	0.05	0.23	'chr20:48800432:48803519:+@chr20:48803520:4880...	(0,0):22,(1,0):4,(1,1):784	0:27,1:761	chr20	+	49367025,49367025	49370278,49370112	chr20:49367025-49370278	0.18	0.11	0.07	0.11	M3_11	TANDEMUTR
4	chr20:18116103:18116679:+@chr20:18116680:18117...	0.08	0.00	0.27	'chr20:18116103:18116679:+@chr20:18116680:1811...	(0,0):10,(1,1):27	0:0,1:27	chr20	+	18168103,18168103	18169016,18168679	chr20:18168103-18169016	0.27	0.19	0.08	0.19	M3_11	TANDEMUTR

Now let's stack these dataframes on top of each other to create a really "tall" dataframe using concat (concatenate) from pandas.

In [6]:

miso_summary = pd.concat(dfs)

In [7]:

miso_summary.head()

Out[7]:

	event_name	miso_posterior_mean	ci_low	ci_high	isoforms	counts	assigned_counts	chrom	strand	mRNA_starts	mRNA_ends	genome_location	ci_diff	ci_left_half	ci_right_half	ci_halves_max	sample_id	splice_type
0	chr20:17870855:17870999:-@chr20:17870242:17870...	0.98	0.94	1.00	'chr20:17870855:17870999:-@chr20:17870242:1787...	(0,0):81,(1,0):174,(1,1):13	0:187	chr20	-	17922242,17922855	17922999,17922999	chr20:17922242-17922999	0.06	0.02	0.04	0.04	M3_11	TANDEMUTR
1	chr20:32559260:32559399:+@chr20:32559400:32562...	1.00	0.99	1.00	'chr20:32559260:32559399:+@chr20:32559400:3256...	(1,0):704	0:704	chr20	+	33095599,33095599	33099198,33095738	chr20:33095599-33099198	0.01	0.00	0.01	0.01	M3_11	TANDEMUTR
2	chr20:30247080:30247914:-@chr20:30243970:30247...	1.00	0.99	1.00	'chr20:30247080:30247914:-@chr20:30243970:3024...	(0,0):1,(1,0):287	0:287	chr20	-	30780309,30783419	30784253,30784253	chr20:30780309-30784253	0.01	0.00	0.01	0.01	M3_11	TANDEMUTR
3	chr20:48800432:48803519:+@chr20:48803520:48803...	0.12	0.05	0.23	'chr20:48800432:48803519:+@chr20:48803520:4880...	(0,0):22,(1,0):4,(1,1):784	0:27,1:761	chr20	+	49367025,49367025	49370278,49370112	chr20:49367025-49370278	0.18	0.11	0.07	0.11	M3_11	TANDEMUTR
4	chr20:18116103:18116679:+@chr20:18116680:18117...	0.08	0.00	0.27	'chr20:18116103:18116679:+@chr20:18116680:1811...	(0,0):10,(1,1):27	0:0,1:27	chr20	+	18168103,18168103	18169016,18168679	chr20:18168103-18169016	0.27	0.19	0.08	0.19	M3_11	TANDEMUTR

In [8]:

miso_summary.shape

Out[8]:

(4895279, 18)

Whoa! This is a HUGE DataFrame! Almost 5 million rows :)

Filter splicing events¶

Next, we will filter the miso events on the confidence interval, and the number of "junction reads", or the counts of reads that MISO deemed unique to only one isoform (but not both).

In [9]:

import re 
import sys

def counts_pair_to_ints(x):
    """Convert an isoform:counts pair to integers
    
    >>> x = '(1,0):5084,'
    >>> counts_pair_to_ints(x)
    ((1, 0), 5084)
    """
    x = x.split(':')
    isoforms = tuple(map(int, x[0].strip('()').split(',')))
    counts = int(x[1].rstrip(','))
    return isoforms, counts

def counts_col_to_dict(counts):
    """Convert a miso counts column value into a dictionary
    
    >>> counts = '(0,0):12,(1,0):5084,(1,1):2036'
    >>> counts_col_to_dict(counts)
    {(0, 0): 12, (1, 0): 5084, (1, 1): 2036}
    """
    return dict(map(counts_pair_to_ints, re.findall('\(\d,\d\):\d+,?', counts)))

def filter_miso_summary(summary, ci_diff_thresh=0.5,
                        specific_isoform_counts_thresh=5):
    """Filter a MISO summary dataframe created by read_miso_summary on 
    the maximum confidence interval size, and number of 
    "junction reads" (reads that are specific to one isoform)
    
    From the MISO documentation:
        (1,0):28,(0,1):50,(1,1):12, where 0 and 1 correspond to the first and 
    second isoforms. This indicates that 28 reads support isoform 1 but not 
    2, while 50 reads support the second but not the first, and 12 reads 
    support both isoforms. The field (0,0):N, if outputted, indicates the 
    number of reads N that mapped to the region but were thrown out (e.g. 
    because they are not consistent with the isoforms in the annotation, 
    violate overhang constraints, or other related reasons.)
    
    Summarized:
    - (1,0) reads support isoform1 and not isoform2
    - (0,1) reads support isoform2 and not isoform1
    - (1,1) reads support both isoforms
    - (0,0) reads support neither isoform
    
    For "junction reads", we filter on the number of "(1,0)" and "(0,1)" reads.

    Parameters
    ----------
    summary : pandas.DataFrame
        A "tall" dataframe of all samples and splice types

    Returns
    -------
    filtered_summary : pandas.DataFrame
        Only the splicing events passing the filters
    """
    original_events = summary.shape[0]
    summary = summary.ix[summary.ci_diff <= ci_diff_thresh]
    after_ci_events = summary.shape[0]
    isoform_counts = pd.DataFrame.from_dict(dict(zip(summary.index,
                                                     summary.counts.map(
                                                         counts_col_to_dict).values)),
                                            orient='index')

    # Get counts that support only one specific isoform "junction reads"
    specific_isoform_counts = isoform_counts.ix[:, [(0, 1), (1, 0)]].sum(axis=1)

    # Filter on at least 10 "junction reads"
    summary = summary.ix[
        specific_isoform_counts >= specific_isoform_counts_thresh]
    after_counts_events = summary.shape[0]

    # Set the index as just the range now that we've filtered everything out
    summary.index = np.arange(after_counts_events)

    sys.stdout.write(' {} events removed with poor confidence (ci >{:.2f})\n'
                     .format(after_ci_events - original_events, ci_diff_thresh))
    sys.stdout.write(' {} events removed with low read counts which are unique'
                     ' to individual isoforms (n < {})\n'.format(
        after_counts_events - after_ci_events, specific_isoform_counts_thresh))
    return summary

In [10]:

filtered_miso_summary = filter_miso_summary(miso_summary)

 -713830 events removed with poor confidence (ci >0.50)
 -538692 events removed with low read counts which are unique to individual isoforms (n < 5)

Filter splicing on MISO bug¶

MISO, published in 2010, was created for use with reads of length 36 bp. One of the assumptions was that none of the splicing event isoforms will be shorter than the read length. But, there are some cases where, for example in a skipped exon event, the "short" isoform of exon1 and exon3 is shorter than our read length of 100bp. And MISO starts estimating all splicing scores at 0.5 and then has to get convinced one way or another with the reads to push the percent spliced-in ( $\Psi$ ) value towards 0 or 1. So, we get a bunch of splicing events whose $\Psi$ score is about 0.5, with very high confidence (the confidence interval is tiny [<0.1] because MISO was never convinced to 0 or 1, and just bounced around 0.5)

To filter for these events, we need the lengths of each isoforms, which we will calculate using gffutils