Before we begin, we will import all the python packages that we need.
from collections import defaultdict
from glob import glob, iglob
import os
import re
import sys
import gffutils
import numpy as np
import pandas as pd
flotilla
package¶This IPython notebook documents how to create a flotilla
package using our motor neuron differentiation data. I have attempted to gradually increase in complexity of file and data management.
We quantified gene expression using sailfish v0.6.3
, using protein-coding and lncRNA transcripts from Gencode v19, and spikeins from both Fluidigm (RNA_Spike_1
, RNA_Spike_4
, and RNA_Spike_7
from Ambion, sequences below) and ERCC (fastq.gz
files). We used a k-mer size of k=31
, and the library type "T=PE:O=><:S=SA"
for samples with paired-end reads and "T=SE:S=U"
for single-end reads.
Performing sailfish
quantification is out of the scope of this notebook, and we will only deal with reading in the files.
>RNA_Spike_1
gtggagaaagaaatggctcgtctggcagcatttgatatggatggcactttattgatgcccgaccatcatttaggtgagaaaaccctctctactttggcgcGactgcgtgaacgcgacattaccctcacttttgccacggggcgtcatgcgctggagatgcagcatattctcggggcgctatcgctggatgcgtatttgatTaccggcaacggaacgcgcgtgcattctctggaaggtgaacttttacatcgtgatgatttacctgcggatgtcgcggagctggtgctgtatcagcaatggGatacccgagccagcatgcatatcttcaatgacgacggttggtttaccgggaaagagatccctgcgttgttgcaggcatttgtctatagcggttttcgttAtcagataatcgatgtcaaaaaaatgccactcggcagcgtcaccaagatctgcttctgtggcgatcacgacgatcttacacgcttgcagatccagctataCgaagcattaggcgagcgtgcacatttgtgtttttccgccacggattgcctcgaagtgctgccggtgggctgcaataaaGGCGCTGCATTGACGGTGCTGACCCAACATTTAGGTTTATCGTTGCGCGATTGCATGGCCTTTGGTGATGcgatgaacgatcgcgaaatgttagtcagcgtcggtagcggatttattatgggcaatgcgatgccgcaactgcgcgcggagctcccgcatttaccggtgattaaaaaaaaaaaaaaaaaaaaaaaaaa
>RNA_Spike_4
attcatctgcgtggcgaagaggtggcagccgtctcgttgcaatgcgtcgggccggggcattacgtcgcacaaccacaatcagaatacgcattcatgcgtagataacattcaggcggagaataaaatggcaagagctgtacaccgtagtgggttagtggcgctgggcattgcgacagcgttgatggcatcttgtgcattcgctgccaaagatgtggtggtggcggtaggatcgaatttcaccacgctcgatccgtatgacgcaaatgacacgttatctcaggccgtagcgaaatcgttttaccaggggctgttcggtctggataaagagatgaaactgaaaaacgtgctggcggagagttataccgtttccgatgacggcattacttacaccgtgaaattgcgggaaggcattaaattccaggatggcaccgatttcaacgccgcggcggtgaaagcgaatctggACCGGGCCAGCGATCCGGCGAATCATCTTAAACGCTATAACCTGTATAAGAATATTGCTAAAACGGAAGCgatcgatccgacaacggtaaagattaccctcaaacagccgttctcagcgtttattaatattcttgcCcatccggcgaccgcgatgatttcaccggcagcgctggaaaaatatggcaaggagattggtttttatccggtgggaaccggaccgtatgaactggataccTggaatcagaccgattttgtgaaggtgaaaaaattcgcgggttactggcagccaggattgcccaaactggacagcataacctggcgtccggtggcggataAcaacacccgcgcggcaatgctgcaaaccggtgaagcgcagtttgctttccccattccttacgagcaggccacactgctggagaaaaacaaaaatatcgaGttgatggccagtccgtcaattatgcagcgttatatcagtatgaacgtgacgcaaaagccgttcgataacccgaaggtccgtgaggcgctgaattacgccaaaaaaaaaaaaaaaaaaaaaaaaaaa
>RNA_Spike_7
ggcccggaaaccctgcgtcaggtcacccaacatgccgagcacgtcgttaatgcgctgaatacggaagcgaaactgccctgcaaactggtgttgaaaccgctgggcaccacgccggatgaaatcaccgctatttgccgcgacgcgaattacgacgatcgttgcgctggtctggtggtgtggctgcacaccttctccccggccaaaatgtggatcaacggcctgaccatgctcaacaaaccgttgctgcaattccacacccagttcaacgcggcgctgccgtgggacagtatcgatatggactttatgaacctgaaccagactgcacatggcggtcgcgagttcggcttcattggcgcgcgtatgcgtcagcaacatgccgtggttaccggtcactggcaggataaacaagcccatgagcgtatcggctcctggatgcgtcaggcggtctctaaacaggatacccgtcatctgaaagtctgccgatttggcgataacatgcgtgaagtggcggtcaccgatggcgataaagttgccgcacagatcaagttcggtttctccgtcaatacctgggcggttggcgatctggtgcaggtggtgaactccatcagcgacggcgatgttaacgcgctggtcgatgagtacgaaagctgctacaccatgacgcctgccacacaaatccacggcaaaaaacgacagaacgtgctggaagcggcgcgtattgagctggggatgaagcgtttcctggaacaaggtggcttccacgcgttcaccaccacctttgaagatttgcacggtctgaaacagcttcctggtctggccgtacagcgtctgatgcagcagggttacggctttgcgggcgaaggcgactggaaaactgccgccctgcttcgcatcatgaaggtgatgtcaaccggtctgcagggcggcacctcctttatggaggactacacctatcacttcgagaaaggtaatgacctggtgctcggctcccatatgctgg
aagtctgcccgtcgatcgccgcagaagagaaaccgatcctcgacGTTCAGCATCTCGGTATTGGTGGTAAGGACGATCCTGCCCGCCTGATCTTCAATACCCAAACCGGCCCAGcgattgtcgccagcttgattgatctcggcgatcgttaccgtctactggttaactgcatcgacacggtgaaaacaccgcactccctgCcgaaactgccggtggcgaatgcgctgtggaaagcgcaaccggatctgccaactgcttccgaagcgtggatcctcgctggtggcgcgcaccataccgtctTcagccatgcactgaacctcaacgatatgcgccaattcgccgagatgcacgacattgaaatcacggtgattgataacgacacacgcctgccagcgtttaaagacgcgctgcgctggaacgaagtgtattacgggtttcgtcgctaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
sailfish quant
command¶sailfish quant --index /projects/ps-yeolab/genomes/hg19/sailfish/gencode.v19.pc_lncRNA_transcripts.ercc_fluidigm_spikein.fa_sailfish_index_k31/ -l "T=PE:O=><:S=SA" -1 M2nd_33_R1_100bp.fastq -2 M2nd_33_R2_100bp.fastq --out sailfish_k31 --threads 8
We output the gene expression quantification into the folder, /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31
. We read in the data using pandas
, a Python package for data analysis. We use iglob
(iterative glob, more memory efficient) from the Python standard library package glob
, to "glob," aka collect all files that follow some pattern. In this case, we want all files that look like this:
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/*.sailfish/quant_bias_corrected.sf
Where the asterisk, "*
" means "any character". We can check this on our filesystem, to see what files actually come up:
ls /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/*.sailfish/quant_bias_corrected.sf
/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_01.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_02.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_03.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_04.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_05.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_06.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_07.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_08.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_09.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_10.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_11.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_12.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_13.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_14.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_15.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_16.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_17.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_18.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_19.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_20.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_21.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_22.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_23.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_24.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_25.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_26.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_27.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_28.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_29.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_30.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_31.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_32.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_33.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_34.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_35.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M1_01.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M1_02.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M1_03.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M1_04.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M1_05.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M1_06.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M1_07.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M1_08.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M1_09.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M1_10.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M1_11.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M1_12.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2_01.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2_02.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2_03.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2_04.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2_05.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2_06.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2_07.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2_08.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2_09.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2_10.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2_11.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2_12.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_01.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_02.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_03.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_04.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_05.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_06.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_07.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_08.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_09.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_10.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_11.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_12.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_13.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_14.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_15.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_16.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_17.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_18.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_19.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_20.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_21.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_22.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_23.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_24.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_25.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_26.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_27.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_28.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_29.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_30.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_31.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_32.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M2nd_34.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M3_1.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M3_10.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M3_11.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M3_12.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M3_13.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M3_14.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M3_2.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M3_3.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M3_4.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M3_5.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M3_6.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M3_7.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M3_8.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M3_9.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M4_10.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M4_11.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M4_12.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M4_13.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M4_14.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M4_2.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M4_3.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M4_4.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M4_5.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M4_6.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M4_7.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M4_8.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M4_9.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M6_1.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M6_2.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M6_3.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M6_4.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M6_5.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M6_6.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/M6_7.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_01.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_02.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_03.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_04.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_05.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_06.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_07.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_08.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_09.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_10.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_11.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_12.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_13.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_14.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_15.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_16.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_17.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_18.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_19.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_20.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_21.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_22.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_23.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_24.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_25.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_26.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_27.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_28.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_29.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_30.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_31.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_32.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_33.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_34.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_35.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_36.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/MSA_37.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N4_10.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N4_11.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N4_12.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N4_13.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N4_14.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N4_2.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N4_3.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N4_4.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N4_5.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N4_6.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N4_7.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N4_8.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N4_9.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N5_1.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N5_10.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N5_11.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N5_12.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N5_13.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N5_14.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N5_2.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N5_3.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N5_4.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N5_5.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N5_6.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N5_7.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N5_8.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/N5_9.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P1_01.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P1_02.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P1_03.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P1_04.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P1_05.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P1_06.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P1_07.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P1_08.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P1_09.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P1_10.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P1_11.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P1_12.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P2_01.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P2_02.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P2_03.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P2_04.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P2_05.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P2_06.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P2_07.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P2_08.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P2_09.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P2_10.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P2_11.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P2_12.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P2_13.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P2_14.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P3_01.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P3_02.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P3_03.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P4_10.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P4_11.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P4_12.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P4_13.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P4_14.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P4_2.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P4_3.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P4_4.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P4_5.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P4_6.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P4_7.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P4_8.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P4_9.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P7_1.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P7_10.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P7_11.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P7_12.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P7_3.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P7_4.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P7_5.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P7_6.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P7_7.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P7_9.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P8_1.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P8_10.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P8_11.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P8_12.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P8_2.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P8_3.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P8_4.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P8_5.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P8_6.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P8_7.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P8_9.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P9_1.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P9_2.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P9_3.sailfish/quant_bias_corrected.sf /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/P9_4.sailfish/quant_bias_corrected.sf
Let's look at the top of one of these files using "head
":
! head /home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31/CVN_01.sailfish/quant_bias_corrected.sf
# [sailfish version] 0.6.3 # [kmer length] 31 # [using canonical kmers] false # [command] sailfish --index /home/obotvinnik/scratch/genomes/hg19/sailfish/gencode.v19.pc_lncRNA_transcripts.ercc_fluidigm_spikein.sailfish_index/transcriptome --counts CVN_01.sailfish/reads.sfc --threads 8 --lutfile /home/obotvinnik/scratch/genomes/hg19/sailfish/gencode.v19.pc_lncRNA_transcripts.ercc_fluidigm_spikein.sailfish_index/transcriptome --iterations 1000 --min_abundance 0 --delta 0.005 --out CVN_01.sailfish/quant.sf # Transcript Length TPM RPKM KPKM EstimatedNumKmers EstimatedNumReads ERCC-00013 808 0 0 0 0 0 ERCC-00012 994 0 0 0 0 0 ERCC-00004 523 280.536 164.787 164.787 42793.2 611.331 ERCC-00003 1023 86.7392 50.9509 50.9509 26650.5 380.721 ERCC-00002 1061 1166.62 685.278 685.278 372160 5316.57
It looks like we don't need the first 5 columns of the file, since its mostly informational. Also, it looks like we'll need to supply our own column names because the regular column names don't follow the exact tab-delimited format, as in, "Transcript
" is preceeded by "#
" on the fifth line. Additionally, we'll want to make sure the first column is assigned as the row names.
With these things in mind, we are ready create the gene expression matrix. To create a matrix of transcripts per million (TPM) for all samples, we do:
import pandas as pd
from glob import iglob
directory = '/home/obotvinnik/scratch/mn_diff_singlecell/fastq/sailfish_k31'
tpm_dfs = []
columns = ['transcript', 'length', 'tpm', 'rpkm', 'kpkm', 'EstimatedNumKmers', 'EstimatedNumReads']
for filename in iglob('{}/*.sailfish/quant_bias_corrected.sf'.format(directory)):
# Read "tabluar" data, separated by tabs.
# Arguments:
# skiprows=5 Skip the first 5 rows
# names=columns Use the column names in the list "columns"
# index_col=0 The first column is the row names (the row names are called the "index" in pandas terms)
df = pd.read_table(filename, skiprows=5, names=columns, index_col=0)
# Get the "series" (aka single column) of TPM
tpm = df.tpm
# To get the sample ID, split by the folder identifier, "/",
# and take the second-to-last item (via "[-2]"), which has the sample
# id, then split on the period, and take the first item via "[0]"
sample_id = filename.split('/')[-2].split('.')[0]
# Change the name of the
tpm.name = sample_id
tpm_dfs.append(tpm)
tpm = pd.concat(tpm_dfs, axis=1)
tpm.head()
M2nd_06 | N5_11 | CVN_15 | M1_08 | M1_11 | M6_6 | P7_9 | M3_6 | P4_11 | N4_13 | ... | M2nd_32 | P4_6 | P7_7 | MSA_30 | MSA_10 | MSA_09 | P1_04 | P7_12 | M4_11 | P7_4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
transcript | |||||||||||||||||||||
ERCC-00013 | 0 | 0 | 0.00000 | 0 | 0.000000 | 0 | 0 | 0 | 0 | 0 | ... | 0.000000 | 0 | 0 | 0 | 0 | 0.000000 | 0 | 0 | 0 | 0 |
ERCC-00012 | 0 | 0 | 0.00000 | 0 | 0.040588 | 0 | 0 | 0 | 0 | 0 | ... | 0.000000 | 0 | 0 | 0 | 0 | 0.000000 | 0 | 0 | 0 | 0 |
ERCC-00004 | 0 | 0 | 528.86700 | 0 | 0.000000 | 0 | 0 | 0 | 0 | 0 | ... | 0.000000 | 0 | 0 | 0 | 0 | 0.000000 | 0 | 0 | 0 | 0 |
ERCC-00003 | 0 | 0 | 2.20783 | 0 | 0.000000 | 0 | 0 | 0 | 0 | 0 | ... | 0.000000 | 0 | 0 | 0 | 0 | 0.000000 | 0 | 0 | 0 | 0 |
ERCC-00002 | 0 | 0 | 1075.16000 | 0 | 0.000000 | 0 | 0 | 0 | 0 | 0 | ... | 0.171494 | 0 | 0 | 0 | 0 | 0.077069 | 0 | 0 | 0 | 0 |
5 rows × 257 columns
Great! Now we have a matrix of gene expression. For flotilla
, we use the scikit-learn
standard of "samples×genes" matrices, so we want that instead of the instead of the "genes×samples" we have now. To do so, we need to rotate the matrix, aka "transpose" it with ".T
":
tpm = tpm.T
tpm.head()
transcript | ERCC-00013 | ERCC-00012 | ERCC-00004 | ERCC-00003 | ERCC-00002 | ERCC-00016 | ERCC-00017 | ERCC-00019 | ERCC-00018 | ERCC-00023 | ... | ENST00000361739.1|ENSG00000198712.1|-|-|MT-CO2-201|MT-CO2|684|CDS:1-684| | ENST00000361899.2|ENSG00000198899.2|-|-|MT-ATP6-201|MT-ATP6|681|CDS:1-681| | ENST00000361227.2|ENSG00000198840.2|-|-|MT-ND3-201|MT-ND3|346|CDS:1-346| | ENST00000361335.1|ENSG00000212907.2|-|-|MT-ND4L-201|MT-ND4L|297|CDS:1-297| | ENST00000361624.2|ENSG00000198804.2|-|-|MT-CO1-201|MT-CO1|1542|CDS:1-1542| | ENST00000362079.2|ENSG00000198938.2|-|-|MT-CO3-201|MT-CO3|784|CDS:1-784| | ENST00000361381.2|ENSG00000198886.2|-|-|MT-ND4-201|MT-ND4|1378|CDS:1-1378| | ENST00000361681.2|ENSG00000198695.2|-|-|MT-ND6-201|MT-ND6|525|CDS:1-525| | ENST00000361567.2|ENSG00000198786.2|-|-|MT-ND5-201|MT-ND5|1812|CDS:1-1812| | ENST00000361789.2|ENSG00000198727.2|-|-|MT-CYB-201|MT-CYB|1141|CDS:1-1141| |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
M2nd_06 | 0 | 0.000000 | 0.000 | 0.00000 | 0.00 | 0 | 0 | 0.046510 | 0 | 0.000000 | ... | 11237.90 | 4614.73 | 101.54200 | 487.545000 | 15661.800 | 3977.20 | 2897.480 | 1092.210 | 470.663 | 7083.11 |
N5_11 | 0 | 0.000000 | 0.000 | 0.00000 | 0.00 | 0 | 0 | 0.312974 | 0 | 0.185788 | ... | 10880.90 | 3231.51 | 217.86800 | 0.250081 | 521.925 | 1584.60 | 285.009 | 287.382 | 388.020 | 2275.80 |
CVN_15 | 0 | 0.000000 | 528.867 | 2.20783 | 1075.16 | 0 | 0 | 138.093000 | 0 | 0.226950 | ... | 9918.57 | 2950.36 | 82.84140 | 852.580000 | 5966.710 | 2789.92 | 3042.450 | 901.814 | 1700.220 | 1519.19 |
M1_08 | 0 | 0.000000 | 0.000 | 0.00000 | 0.00 | 0 | 0 | 0.124148 | 0 | 0.000000 | ... | 4822.03 | 1825.03 | 9.81727 | 898.290000 | 890.859 | 1542.99 | 1597.970 | 414.629 | 666.589 | 2492.60 |
M1_11 | 0 | 0.040588 | 0.000 | 0.00000 | 0.00 | 0 | 0 | 0.061035 | 0 | 0.119161 | ... | 7653.41 | 1064.44 | 32.97060 | 639.129000 | 4353.380 | 1919.75 | 1653.810 | 241.076 | 285.977 | 1500.75 |
5 rows × 119307 columns
Excellent, now it is rotated.
Notice how the ERCC spike-ins are around? We don't want them in the gene expression data. From looking at the data, I know that the spike-ins are named "ERCC-###
" or "RNA_Spike_#
". We will remove the spike-ins using a boolean array, which will tell us whether or not the column name starts with "ENST", which all biological genes here do.
spikein_columns = tpm.columns.map(lambda x: not x.startswith('ENST'))
tpm_spikein = tpm.ix[:, spikein_columns]
tpm_transcripts = tpm.ix[:, ~spikein_columns]
tpm_spikein.head()
transcript | ERCC-00013 | ERCC-00012 | ERCC-00004 | ERCC-00003 | ERCC-00002 | ERCC-00016 | ERCC-00017 | ERCC-00019 | ERCC-00018 | ERCC-00023 | ... | ERCC-00163 | ERCC-00164 | ERCC-00165 | ERCC-00168 | ERCC-00170 | ERCC-00171 | RNA_Spike_1 | RNA_Spike_4 | ERCC_vector | RNA_Spike_7 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
M2nd_06 | 0 | 0.000000 | 0.000 | 0.00000 | 0.00 | 0 | 0 | 0.046510 | 0 | 0.000000 | ... | 0 | 0 | 0 | 0.00000 | 0 | 0.000 | 53249.30 | 1566.410 | 0.000000 | 2236.38000 |
N5_11 | 0 | 0.000000 | 0.000 | 0.00000 | 0.00 | 0 | 0 | 0.312974 | 0 | 0.185788 | ... | 0 | 0 | 0 | 0.08166 | 0 | 0.000 | 567464.00 | 21656.700 | 0.552183 | 18505.70000 |
CVN_15 | 0 | 0.000000 | 528.867 | 2.20783 | 1075.16 | 0 | 0 | 138.093000 | 0 | 0.226950 | ... | 0 | 0 | 0 | 0.00000 | 0 | 580.993 | 6097.91 | 213.196 | 0.344638 | 7.44064 |
M1_08 | 0 | 0.000000 | 0.000 | 0.00000 | 0.00 | 0 | 0 | 0.124148 | 0 | 0.000000 | ... | 0 | 0 | 0 | 0.00000 | 0 | 0.000 | 56437.40 | 2365.120 | 0.000000 | 2431.49000 |
M1_11 | 0 | 0.040588 | 0.000 | 0.00000 | 0.00 | 0 | 0 | 0.061035 | 0 | 0.119161 | ... | 0 | 0 | 0 | 0.00000 | 0 | 0.000 | 85006.90 | 3941.410 | 0.239996 | 2773.22000 |
5 rows × 100 columns
tpm_transcripts.head()
transcript | ENST00000607096.1|ENSG00000243485.2|OTTHUMG00000000959.2|-|MIR1302-11-201|MIR1302-11|138| | ENST00000473358.1|ENSG00000243485.2|OTTHUMG00000000959.2|OTTHUMT00000002840.1|MIR1302-11-001|MIR1302-11|712| | ENST00000469289.1|ENSG00000243485.2|OTTHUMG00000000959.2|OTTHUMT00000002841.2|MIR1302-11-002|MIR1302-11|535| | ENST00000461467.1|ENSG00000237613.2|OTTHUMG00000000960.1|OTTHUMT00000002843.1|FAM138A-002|FAM138A|590| | ENST00000417324.1|ENSG00000237613.2|OTTHUMG00000000960.1|OTTHUMT00000002842.1|FAM138A-001|FAM138A|1187| | ENST00000477740.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003688.1|RP11-34P13.7-003|RP11-34P13.7|491| | ENST00000471248.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003687.1|RP11-34P13.7-002|RP11-34P13.7|629| | ENST00000453576.2|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003689.1|RP11-34P13.7-004|RP11-34P13.7|336| | ENST00000466430.1|ENSG00000238009.2|OTTHUMG00000001096.2|OTTHUMT00000003225.1|RP11-34P13.7-001|RP11-34P13.7|2748| | ENST00000493797.1|ENSG00000239906.1|OTTHUMG00000002481.1|OTTHUMT00000007038.1|RP11-34P13.14-001|RP11-34P13.14|323| | ... | ENST00000361739.1|ENSG00000198712.1|-|-|MT-CO2-201|MT-CO2|684|CDS:1-684| | ENST00000361899.2|ENSG00000198899.2|-|-|MT-ATP6-201|MT-ATP6|681|CDS:1-681| | ENST00000361227.2|ENSG00000198840.2|-|-|MT-ND3-201|MT-ND3|346|CDS:1-346| | ENST00000361335.1|ENSG00000212907.2|-|-|MT-ND4L-201|MT-ND4L|297|CDS:1-297| | ENST00000361624.2|ENSG00000198804.2|-|-|MT-CO1-201|MT-CO1|1542|CDS:1-1542| | ENST00000362079.2|ENSG00000198938.2|-|-|MT-CO3-201|MT-CO3|784|CDS:1-784| | ENST00000361381.2|ENSG00000198886.2|-|-|MT-ND4-201|MT-ND4|1378|CDS:1-1378| | ENST00000361681.2|ENSG00000198695.2|-|-|MT-ND6-201|MT-ND6|525|CDS:1-525| | ENST00000361567.2|ENSG00000198786.2|-|-|MT-ND5-201|MT-ND5|1812|CDS:1-1812| | ENST00000361789.2|ENSG00000198727.2|-|-|MT-CYB-201|MT-CYB|1141|CDS:1-1141| |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
M2nd_06 | 0 | 0 | 0 | 1.510590 | 0 | 0 | 0 | 0 | 0.000000 | 0 | ... | 11237.90 | 4614.73 | 101.54200 | 487.545000 | 15661.800 | 3977.20 | 2897.480 | 1092.210 | 470.663 | 7083.11 |
N5_11 | 0 | 0 | 0 | 3.535900 | 0 | 0 | 0 | 0 | 0.000000 | 0 | ... | 10880.90 | 3231.51 | 217.86800 | 0.250081 | 521.925 | 1584.60 | 285.009 | 287.382 | 388.020 | 2275.80 |
CVN_15 | 0 | 0 | 0 | 0.528442 | 0 | 0 | 0 | 0 | 0.000000 | 0 | ... | 9918.57 | 2950.36 | 82.84140 | 852.580000 | 5966.710 | 2789.92 | 3042.450 | 901.814 | 1700.220 | 1519.19 |
M1_08 | 0 | 0 | 0 | 0.503825 | 0 | 0 | 0 | 0 | 0.000000 | 0 | ... | 4822.03 | 1825.03 | 9.81727 | 898.290000 | 890.859 | 1542.99 | 1597.970 | 414.629 | 666.589 | 2492.60 |
M1_11 | 0 | 0 | 0 | 0.387460 | 0 | 0 | 0 | 0 | 0.284236 | 0 | ... | 7653.41 | 1064.44 | 32.97060 | 639.129000 | 4353.380 | 1919.75 | 1653.810 | 241.076 | 285.977 | 1500.75 |
5 rows × 119207 columns
ensembl_ids = tpm_transcripts.columns.map(lambda x: x.split('|')[1].split('.')[0])
Now we will set the column names of the tpm_transcripts
dataframe to be these ENSEMBL ids.
tpm_transcripts.columns = ensembl_ids
This is where it gets a little magical. We will use pandas
' groupby
command to take all columns with the same name (i.e. transcripts with the same gene ID), and sum them together in this one amazing command.
tpm_genes = tpm_transcripts.groupby(level=0, axis=1).sum()
Let's compare these dataframes
tpm_transcripts.head()
ENSG00000243485 | ENSG00000243485 | ENSG00000243485 | ENSG00000237613 | ENSG00000237613 | ENSG00000238009 | ENSG00000238009 | ENSG00000238009 | ENSG00000238009 | ENSG00000239906 | ... | ENSG00000198712 | ENSG00000198899 | ENSG00000198840 | ENSG00000212907 | ENSG00000198804 | ENSG00000198938 | ENSG00000198886 | ENSG00000198695 | ENSG00000198786 | ENSG00000198727 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
M2nd_06 | 0 | 0 | 0 | 1.510590 | 0 | 0 | 0 | 0 | 0.000000 | 0 | ... | 11237.90 | 4614.73 | 101.54200 | 487.545000 | 15661.800 | 3977.20 | 2897.480 | 1092.210 | 470.663 | 7083.11 |
N5_11 | 0 | 0 | 0 | 3.535900 | 0 | 0 | 0 | 0 | 0.000000 | 0 | ... | 10880.90 | 3231.51 | 217.86800 | 0.250081 | 521.925 | 1584.60 | 285.009 | 287.382 | 388.020 | 2275.80 |
CVN_15 | 0 | 0 | 0 | 0.528442 | 0 | 0 | 0 | 0 | 0.000000 | 0 | ... | 9918.57 | 2950.36 | 82.84140 | 852.580000 | 5966.710 | 2789.92 | 3042.450 | 901.814 | 1700.220 | 1519.19 |
M1_08 | 0 | 0 | 0 | 0.503825 | 0 | 0 | 0 | 0 | 0.000000 | 0 | ... | 4822.03 | 1825.03 | 9.81727 | 898.290000 | 890.859 | 1542.99 | 1597.970 | 414.629 | 666.589 | 2492.60 |
M1_11 | 0 | 0 | 0 | 0.387460 | 0 | 0 | 0 | 0 | 0.284236 | 0 | ... | 7653.41 | 1064.44 | 32.97060 | 639.129000 | 4353.380 | 1919.75 | 1653.810 | 241.076 | 285.977 | 1500.75 |
5 rows × 119207 columns
tpm_genes.head()
ENSG00000000003 | ENSG00000000005 | ENSG00000000419 | ENSG00000000457 | ENSG00000000460 | ENSG00000000938 | ENSG00000000971 | ENSG00000001036 | ENSG00000001084 | ENSG00000001167 | ... | ENSGR0000214717 | ENSGR0000223511 | ENSGR0000223571 | ENSGR0000226179 | ENSGR0000230542 | ENSGR0000234622 | ENSGR0000236017 | ENSGR0000236871 | ENSGR0000237531 | ENSGR0000270726 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
M2nd_06 | 774.52100 | 0 | 2.707418 | 41.34380 | 18.84230 | 0.00000 | 0 | 2.077373 | 0.00000 | 0.000000 | ... | 0.000000 | 1.466740 | 3.186130 | 7.85690 | 0.589833 | 0 | 4.445910 | 0.196016 | 0 | 0.00000 |
N5_11 | 7.66172 | 0 | 0.000000 | 3.70781 | 0.00000 | 0.00000 | 0 | 0.000000 | 0.00000 | 0.000000 | ... | 1.940390 | 0.440097 | 0.705532 | 3.43100 | 2.220190 | 0 | 0.068773 | 1.743350 | 0 | 0.00000 |
CVN_15 | 91.43570 | 0 | 153.603786 | 3.49776 | 4.26198 | 0.00000 | 0 | 1.058168 | 0.00000 | 0.000000 | ... | 2.637697 | 0.280143 | 0.294435 | 4.65545 | 0.782298 | 0 | 0.000000 | 0.029957 | 0 | 0.00000 |
M1_08 | 125.98200 | 0 | 0.000000 | 43.97036 | 29.26900 | 0.00000 | 0 | 0.380472 | 5.31102 | 0.260485 | ... | 84.960700 | 2.539930 | 0.390424 | 6.25367 | 0.409578 | 0 | 1.356720 | 0.000000 | 0 | 5.25417 |
M1_11 | 2.51551 | 0 | 0.182660 | 4.81471 | 0.00000 | 0.85309 | 0 | 1.951130 | 3.12804 | 0.000000 | ... | 1.450610 | 2.317610 | 0.259031 | 5.35454 | 0.370591 | 0 | 2.623580 | 4.293600 | 0 | 29.78240 |
5 rows × 34608 columns
Well, tpm_genes
has fewer columns, only 34,608 instead of tpm_transcripts
' 119,207. Let's double-check to make sure it worked, using the ENSEMBL id "ENSG00000000419".
ensembl_id = 'ENSG00000000419'
tpm_transcripts[ensembl_id].head()
ENSG00000000419 | ENSG00000000419 | ENSG00000000419 | ENSG00000000419 | ENSG00000000419 | |
---|---|---|---|---|---|
M2nd_06 | 0.0000 | 0 | 0.586718 | 2.12070 | 0 |
N5_11 | 0.0000 | 0 | 0.000000 | 0.00000 | 0 |
CVN_15 | 79.6027 | 0 | 0.035286 | 73.96580 | 0 |
M1_08 | 0.0000 | 0 | 0.000000 | 0.00000 | 0 |
M1_11 | 0.0000 | 0 | 0.000000 | 0.18266 | 0 |
tpm_genes[ensembl_id].head()
M2nd_06 2.707418 N5_11 0.000000 CVN_15 153.603786 M1_08 0.000000 M1_11 0.182660 Name: ENSG00000000419, dtype: float64
Yes, for every transcript from the gene ENSG00000000419, we get one number for each sample.
Fantastic! Now we've made our gene expression and spike-in matrices, let's make our mapping stats matrix.
We used RNA-STAR to map our reads. Running RNA-STAR on all our samples is out of the scope of this notebook. We used Gencode v19 to create a genome with a splice junction database. We created the RNA-STAR splice junction database with the command,
STAR --runMode genomeGenerate --genomeDir star_sjdb --genomeFastaFiles /projects/ps-yeolab/genomes/hg19/chromosomes/all.fa --runThreadN 16 --sjdbGTFfile /projects/ps-yeolab/genomes/hg19/gencode/v19/gencode.v19.annotation.gtf --sjdbOverhang 200
Our genomeParameters.txt
file is,
cat /projects/ps-yeolab/genomes/hg19/star_sjdb/genomeParameters.txt
versionGenome 20201 genomeFastaFiles /projects/ps-yeolab/genomes/hg19/chromosomes/all.fa genomeSAindexNbases 14 genomeChrBinNbits 18 genomeSAsparseD 1 sjdbOverhang 200 sjdbFileChrStartEnd -
An example RNA-STAR command is this,
STAR --runMode alignReads --runThreadN 8 --genomeDir /projects/ps-yeolab/genomes/hg19/star_sjdb --genomeLoad LoadAndRemove --readFilesCommand zcat --readFilesIn M1_01_TAAGGCGA-TATCCTCT_L007_R1.fastq.gz M1_01_TAAGGCGA-TATCCTCT_L007_R2.fastq.gz --outFileNamePrefix ./M1_01. --outReadsUnmapped Fastx --outFilterMismatchNmax 5 --outFilterMismatchNoverLmax 0.3 --outFilterMultimapNmax 5 --outFilterScoreMin 10 --outFilterType BySJout --outSAMattributes All --outSAMstrandField intronMotif --clip5pNbases 0 --clip3pNbases 0
RNA-STAR outputs a final log file which indicates how many reads were input to the mapper, how many were finally mapped, and more. These files look like this:
cat /home/obotvinnik/scratch/mn_diff_singlecell/fastq/mapping_stats/M1_01.Log.final.out
Started job on | Apr 29 19:58:18 Started mapping on | Apr 29 19:58:40 Finished on | Apr 29 20:01:22 Mapping speed, Million of reads per hour | 274.58 Number of input reads | 12356210 Average input read length | 184 UNIQUE READS: Uniquely mapped reads number | 10684874 Uniquely mapped reads % | 86.47% Average mapped length | 181.36 Number of splices: Total | 4629200 Number of splices: Annotated (sjdb) | 4559773 Number of splices: GT/AG | 4594566 Number of splices: GC/AG | 23888 Number of splices: AT/AC | 5051 Number of splices: Non-canonical | 5695 Mismatch rate per base, % | 0.37% Deletion rate per base | 0.01% Deletion average length | 1.57 Insertion rate per base | 0.01% Insertion average length | 1.43 MULTI-MAPPING READS: Number of reads mapped to multiple loci | 282305 % of reads mapped to multiple loci | 2.28% Number of reads mapped to too many loci | 19107 % of reads mapped to too many loci | 0.15% UNMAPPED READS: % of reads unmapped: too many mismatches | 0.00% % of reads unmapped: too short | 11.02% % of reads unmapped: other | 0.07%
This is more complicated to parse than the expression data, so I've written a few commands to make this easier.
from glob import glob # We use "glob" instead of "iglob" here because we want the whole list around
import numpy as np
def make_unique(seq, idfun=None):
'''
if an object appears more than once in a list, append a letter to it
Modified from: http://www.peterbe.com/plog/uniqifiers-benchmark
'''
if idfun is None:
def idfun(x): return x
seen = {}
result = []
for item in seq:
marker = idfun(item)
# in old Python versions:
# if seen.has_key(marker)
# but in new ones:
if marker in seen:
seen[marker] += 1
result.append(item + string.lowercase[seen[marker] - 2])
continue
seen[marker] = 1
result.append(item)
return result
def try_converting_to_float(x):
try:
return float(x)
except ValueError:
return x
def fix_duplicate_columns(mapping_stats):
'''
Given a dataframe of mapping stats with "duplicate" column names
(actually duplicate column names aren't allowed in pandas,
but this assumes you have columns like M1_01 and M1_01a, and the M1_01a
column was created from the second *.Log.final.out file from RNA-STAR),
this detects the duplicate columns and sends them to merge_mapping_stats,
which combines the duplicate colmns into single column in a reasonable way
'''
duplicate_columns = mapping_stats.filter(regex='[a-z]$')
mapping_stats.drop((x[:-1] for x in duplicate_columns), axis=1)
merged_columns = {}
for col in duplicate_columns:
original_col = col[:-1]
merged_columns[original_col] = merge_mapping_stats(
mapping_stats.ix[:, col], mapping_stats.ix[:, original_col])
# Remove duplicately-named columns, e.g. M1_01 and M1_01a
mapping_stats = mapping_stats.drop((x for x in duplicate_columns), axis=1)
mapping_stats = mapping_stats.drop((x[:-1] for x in duplicate_columns),
axis=1)
merged_df = pd.DataFrame(data=merged_columns, index=mapping_stats.index)
mapping_stats = pd.concat((mapping_stats, merged_df), axis=1)
return mapping_stats
def log_final_out(glob_command, ids_function):
"""
Given a glob command describing where all the Log.final.out files are from
STAR, return a pd.DataFrame with each sample (id) as its own column.
glob_command : str
Where to look for the files, will be passed to "glob"
ids_function : function
A function (could be an anonymous function like a
lambda) that specifies how to get the sample ID from the filename. Could
also be a list of IDs, but they must be in the exact order as in the
directories, which is why a function can be easier.
Example:
>>> glob_command = '/Users/olga/workspace-git/single_cell/analysis/mapping_stats/*.Log.final.out'
>>> mapping_stats = log_final_out(glob_command,
... lambda x: '_'.join(x.split('/')[-1].split('_')[:2]))
"""
series = []
filenames = glob(glob_command)
if isinstance(ids_function, list):
ids = ids_function
else:
ids = [ids_function(filename) for filename in filenames]
for filename in filenames:
s = pd.read_table(filename, header=None, index_col=0, squeeze=True)
converted = [try_converting_to_float(x.strip('%'))
if type(x) != float else x for x in s]
series.append(pd.Series(converted, index=s.index))
mapping_stats = pd.concat(series, keys=make_unique(ids), axis=1)
new_index = [str(x).strip().rstrip(' |') for x in mapping_stats.index]
mapping_stats.index = new_index
mapping_stats = mapping_stats.dropna(how='all')
mapping_stats = fix_duplicate_columns(mapping_stats)
# Turn all the number of splicing events into percentages for statistical
# testing
number_splicing_event_names = ['Number of splices: Annotated (sjdb)',
'Number of splices: GT/AG',
'Number of splices: GC/AG',
'Number of splices: AT/AC',
'Number of splices: Non-canonical']
percent_splicing_event_names = [x.replace('Number of', '%')
for x in number_splicing_event_names]
total_splicing_events = mapping_stats.ix['Number of splices: Total',
:].replace(0, np.nan).values.astype(float)
pieces = []
for num_events in zip(number_splicing_event_names):
pieces.append(100.0 * mapping_stats.ix[num_events,
:].values \
/ total_splicing_events)
pieces = [np.reshape(piece, len(mapping_stats.columns)) for piece in pieces]
percent_splicing = pd.DataFrame(pieces, index=percent_splicing_event_names,
columns=mapping_stats.columns)
return pd.concat((mapping_stats, percent_splicing))
The files are all in /home/obotvinnik/scratch/mn_diff_singlecell/fastq/mapping_stats/
, so we set our glob command here:
glob_command = '/home/obotvinnik/scratch/mn_diff_singlecell/fastq/mapping_stats/*.Log.final.out'
To get the ID from the filename, we will use a cute feature of Python which is called "lambda" functions, or "anonymous" functions that do something small.
import os
ids_function = lambda x: os.path.basename(x).split('.')[0]
What we've done here is take the "basename" of the file, meaning the name of the file without all the folders before it:
x = '/home/obotvinnik/scratch/mn_diff_singlecell/fastq/mapping_stats/M1_01.Log.final.out'
os.path.basename(x)
'M1_01.Log.final.out'
And then we split on the period and get the first item
os.path.basename(x).split('.')[0]
'M1_01'
mapping_stats = log_final_out(glob_command, ids_function)
mapping_stats.head()
M2_02 | M2nd_06 | M3_4 | M2_03 | N5_3 | M2_09 | P1_11 | M2nd_27 | P7_12 | M3_6 | ... | MSA_09 | M4_10 | P4_5 | P7_10 | M2_10 | MSA_05 | P7_1 | N4_10 | P1_07 | MSA_37 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Started job on | Apr 29 18:26:01 | Apr 29 19:34:06 | Apr 29 19:59:10 | Apr 29 18:43:16 | Apr 29 19:00:06 | Apr 29 19:28:44 | Apr 29 19:33:53 | Apr 29 18:40:07 | Apr 29 18:59:50 | Apr 29 19:38:15 | ... | Apr 29 18:56:06 | Apr 30 00:32:04 | Apr 29 18:41:49 | Apr 29 19:24:10 | Apr 29 18:36:03 | Apr 29 19:04:08 | Apr 29 18:08:00 | Apr 29 19:22:36 | Apr 29 19:48:11 | Apr 29 18:46:46 |
Started mapping on | Apr 29 18:26:38 | Apr 29 19:35:23 | Apr 29 20:00:54 | Apr 29 18:43:33 | Apr 29 19:00:24 | Apr 29 19:29:05 | Apr 29 19:34:12 | Apr 29 18:41:01 | Apr 29 19:00:09 | Apr 29 19:38:33 | ... | Apr 29 18:56:25 | Apr 30 00:32:42 | Apr 29 18:42:15 | Apr 29 19:24:30 | Apr 29 18:36:25 | Apr 29 19:04:33 | Apr 29 18:09:47 | Apr 29 19:22:59 | Apr 29 19:48:54 | Apr 29 18:47:18 |
Finished on | Apr 29 18:29:59 | Apr 29 19:37:07 | Apr 29 20:03:33 | Apr 29 18:46:39 | Apr 29 19:07:51 | Apr 29 19:32:06 | Apr 29 19:38:07 | Apr 29 18:41:41 | Apr 29 19:04:10 | Apr 29 19:40:48 | ... | Apr 29 18:58:02 | Apr 30 00:41:16 | Apr 29 18:56:23 | Apr 29 19:28:38 | Apr 29 18:40:01 | Apr 29 19:06:09 | Apr 29 18:27:44 | Apr 29 19:38:59 | Apr 29 19:52:13 | Apr 29 18:48:58 |
Mapping speed, Million of reads per hour | 233.39 | 214.27 | 135.81 | 284.74 | 149.2 | 287.58 | 296.02 | 126.21 | 229.51 | 211.58 | ... | 211.62 | 207.09 | 198.79 | 213.37 | 307.66 | 196.75 | 200.54 | 158.97 | 270.91 | 126.15 |
Number of input reads | 1.303068e+07 | 6190057 | 5998270 | 1.47114e+07 | 1.852628e+07 | 1.445902e+07 | 1.93232e+07 | 1402360 | 1.536468e+07 | 7934410 | ... | 5701984 | 2.956736e+07 | 4.68265e+07 | 1.46987e+07 | 1.845972e+07 | 5246777 | 5.999607e+07 | 4.239144e+07 | 1.497556e+07 | 3504200 |
5 rows × 256 columns
Again, we want to have a samples×features matrix, so we will transpose.
mapping_stats = mapping_stats.T
Now let's read in the splicing data. This gets complicated because of the filtering. We use MISO (v0.5.1) to quantify gene expression on our data, for all available splice types:
This is a figure showing what the different splice types look like, from Wang ET, et al. Nature (2008) paper:
Again, running MISO is out of the scope of this notebook, and we will focus on reading in the final files.
Here is an example of a script to calculate the "percent spliced-in" (PSI or "Ψ") scores. I didn't say it was pretty. One note is that we treat all samples as if they are single-end reads, because I believe MISO's paired-end read filters to be too strict (with the mean and standard deviation of the insert size), and then a lot of our data is thrown out.
cat /home/obotvinnik/scratch/mn_diff_singlecell/fastq/M1_01.Aligned.out.sam.bam.sorted.bam.miso.sh
#!/bin/bash # Finding all MISO splicing scores for sample: M1_01. Yay! READ_LEN=$(samtools view /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/M1_01.Aligned.out.sam.bam.sorted.bam | head -n 1 | cut -f 10 | awk '{ print length }') # calculate Psi scores for all SE events mkdir -p /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/SE python /projects/ps-yeolab/software/bin/miso --run /projects/ps-yeolab/genomes/hg19/miso/SE_index /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/M1_01.Aligned.out.sam.bam.sorted.bam --output-dir /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/SE --read-len $READ_LEN --settings-filename /projects/ps-yeolab/genomes/hg19/miso_annotations/miso_settings_min_event_reads10.txt -p 16 > /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/SE/psi.out 2> /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/SE/psi.err # Check that the psi calculation jobs didn't fail. #'-z' returns true when a string is empty, so this is checking that grepping these files for the words 'failed' and 'shutdown' didn't find anything. iffailed=$(grep failed /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/SE/psi.out) ifshutdown=$(grep shutdown /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/SE/psi.err) if [ ! -z "$iffailed" -o ! -z "$ifshutdown" ] ; then #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/SE echo "MISO psi failed on event type: SE" exit 1 fi # Summarize psi scores for all SE events python /home/yeo-lab/software/bin/run_miso.py --summarize-samples /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/SE /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/SE >/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/SE/summary.out 2>/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/SE/summary.err # Check that the summary jobs didn't fail # '-s' returns true if file size is nonzero, and the error file should be empty. if [ -s /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/SE/summary.err ] ; then #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/SE echo 'MISO psi failed on event type: SE' exit 1 fi # calculate Psi scores for all MXE events mkdir -p /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/MXE python /projects/ps-yeolab/software/bin/miso --run /projects/ps-yeolab/genomes/hg19/miso/MXE_index /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/M1_01.Aligned.out.sam.bam.sorted.bam --output-dir /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/MXE --read-len $READ_LEN --settings-filename /projects/ps-yeolab/genomes/hg19/miso_annotations/miso_settings_min_event_reads10.txt -p 16 > /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/MXE/psi.out 2> /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/MXE/psi.err # Check that the psi calculation jobs didn't fail. #'-z' returns true when a string is empty, so this is checking that grepping these files for the words 'failed' and 'shutdown' didn't find anything. iffailed=$(grep failed /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/MXE/psi.out) ifshutdown=$(grep shutdown /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/MXE/psi.err) if [ ! -z "$iffailed" -o ! -z "$ifshutdown" ] ; then #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/MXE echo "MISO psi failed on event type: MXE" exit 1 fi # Summarize psi scores for all MXE events python /home/yeo-lab/software/bin/run_miso.py --summarize-samples /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/MXE /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/MXE >/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/MXE/summary.out 2>/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/MXE/summary.err # Check that the summary jobs didn't fail # '-s' returns true if file size is nonzero, and the error file should be empty. if [ -s /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/MXE/summary.err ] ; then #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/MXE echo 'MISO psi failed on event type: MXE' exit 1 fi # calculate Psi scores for all AFE events mkdir -p /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/AFE python /projects/ps-yeolab/software/bin/miso --run /projects/ps-yeolab/genomes/hg19/miso/AFE_index /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/M1_01.Aligned.out.sam.bam.sorted.bam --output-dir /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/AFE --read-len $READ_LEN --settings-filename /projects/ps-yeolab/genomes/hg19/miso_annotations/miso_settings_min_event_reads10.txt -p 16 > /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/AFE/psi.out 2> /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/AFE/psi.err # Check that the psi calculation jobs didn't fail. #'-z' returns true when a string is empty, so this is checking that grepping these files for the words 'failed' and 'shutdown' didn't find anything. iffailed=$(grep failed /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/AFE/psi.out) ifshutdown=$(grep shutdown /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/AFE/psi.err) if [ ! -z "$iffailed" -o ! -z "$ifshutdown" ] ; then #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/AFE echo "MISO psi failed on event type: AFE" exit 1 fi # Summarize psi scores for all AFE events python /home/yeo-lab/software/bin/run_miso.py --summarize-samples /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/AFE /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/AFE >/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/AFE/summary.out 2>/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/AFE/summary.err # Check that the summary jobs didn't fail # '-s' returns true if file size is nonzero, and the error file should be empty. if [ -s /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/AFE/summary.err ] ; then #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/AFE echo 'MISO psi failed on event type: AFE' exit 1 fi # calculate Psi scores for all ALE events mkdir -p /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/ALE python /projects/ps-yeolab/software/bin/miso --run /projects/ps-yeolab/genomes/hg19/miso/ALE_index /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/M1_01.Aligned.out.sam.bam.sorted.bam --output-dir /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/ALE --read-len $READ_LEN --settings-filename /projects/ps-yeolab/genomes/hg19/miso_annotations/miso_settings_min_event_reads10.txt -p 16 > /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/ALE/psi.out 2> /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/ALE/psi.err # Check that the psi calculation jobs didn't fail. #'-z' returns true when a string is empty, so this is checking that grepping these files for the words 'failed' and 'shutdown' didn't find anything. iffailed=$(grep failed /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/ALE/psi.out) ifshutdown=$(grep shutdown /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/ALE/psi.err) if [ ! -z "$iffailed" -o ! -z "$ifshutdown" ] ; then #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/ALE echo "MISO psi failed on event type: ALE" exit 1 fi # Summarize psi scores for all ALE events python /home/yeo-lab/software/bin/run_miso.py --summarize-samples /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/ALE /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/ALE >/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/ALE/summary.out 2>/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/ALE/summary.err # Check that the summary jobs didn't fail # '-s' returns true if file size is nonzero, and the error file should be empty. if [ -s /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/ALE/summary.err ] ; then #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/ALE echo 'MISO psi failed on event type: ALE' exit 1 fi # calculate Psi scores for all A3SS events mkdir -p /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A3SS python /projects/ps-yeolab/software/bin/miso --run /projects/ps-yeolab/genomes/hg19/miso/A3SS_index /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/M1_01.Aligned.out.sam.bam.sorted.bam --output-dir /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A3SS --read-len $READ_LEN --settings-filename /projects/ps-yeolab/genomes/hg19/miso_annotations/miso_settings_min_event_reads10.txt -p 16 > /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A3SS/psi.out 2> /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A3SS/psi.err # Check that the psi calculation jobs didn't fail. #'-z' returns true when a string is empty, so this is checking that grepping these files for the words 'failed' and 'shutdown' didn't find anything. iffailed=$(grep failed /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A3SS/psi.out) ifshutdown=$(grep shutdown /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A3SS/psi.err) if [ ! -z "$iffailed" -o ! -z "$ifshutdown" ] ; then #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A3SS echo "MISO psi failed on event type: A3SS" exit 1 fi # Summarize psi scores for all A3SS events python /home/yeo-lab/software/bin/run_miso.py --summarize-samples /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A3SS /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A3SS >/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A3SS/summary.out 2>/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A3SS/summary.err # Check that the summary jobs didn't fail # '-s' returns true if file size is nonzero, and the error file should be empty. if [ -s /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A3SS/summary.err ] ; then #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A3SS echo 'MISO psi failed on event type: A3SS' exit 1 fi # calculate Psi scores for all A5SS events mkdir -p /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A5SS python /projects/ps-yeolab/software/bin/miso --run /projects/ps-yeolab/genomes/hg19/miso/A5SS_index /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/M1_01.Aligned.out.sam.bam.sorted.bam --output-dir /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A5SS --read-len $READ_LEN --settings-filename /projects/ps-yeolab/genomes/hg19/miso_annotations/miso_settings_min_event_reads10.txt -p 16 > /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A5SS/psi.out 2> /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A5SS/psi.err # Check that the psi calculation jobs didn't fail. #'-z' returns true when a string is empty, so this is checking that grepping these files for the words 'failed' and 'shutdown' didn't find anything. iffailed=$(grep failed /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A5SS/psi.out) ifshutdown=$(grep shutdown /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A5SS/psi.err) if [ ! -z "$iffailed" -o ! -z "$ifshutdown" ] ; then #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A5SS echo "MISO psi failed on event type: A5SS" exit 1 fi # Summarize psi scores for all A5SS events python /home/yeo-lab/software/bin/run_miso.py --summarize-samples /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A5SS /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A5SS >/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A5SS/summary.out 2>/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A5SS/summary.err # Check that the summary jobs didn't fail # '-s' returns true if file size is nonzero, and the error file should be empty. if [ -s /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A5SS/summary.err ] ; then #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/A5SS echo 'MISO psi failed on event type: A5SS' exit 1 fi # calculate Psi scores for all RI events mkdir -p /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/RI python /projects/ps-yeolab/software/bin/miso --run /projects/ps-yeolab/genomes/hg19/miso/RI_index /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/M1_01.Aligned.out.sam.bam.sorted.bam --output-dir /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/RI --read-len $READ_LEN --settings-filename /projects/ps-yeolab/genomes/hg19/miso_annotations/miso_settings_min_event_reads10.txt -p 16 > /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/RI/psi.out 2> /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/RI/psi.err # Check that the psi calculation jobs didn't fail. #'-z' returns true when a string is empty, so this is checking that grepping these files for the words 'failed' and 'shutdown' didn't find anything. iffailed=$(grep failed /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/RI/psi.out) ifshutdown=$(grep shutdown /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/RI/psi.err) if [ ! -z "$iffailed" -o ! -z "$ifshutdown" ] ; then #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/RI echo "MISO psi failed on event type: RI" exit 1 fi # Summarize psi scores for all RI events python /home/yeo-lab/software/bin/run_miso.py --summarize-samples /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/RI /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/RI >/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/RI/summary.out 2>/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/RI/summary.err # Check that the summary jobs didn't fail # '-s' returns true if file size is nonzero, and the error file should be empty. if [ -s /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/RI/summary.err ] ; then #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/RI echo 'MISO psi failed on event type: RI' exit 1 fi # calculate Psi scores for all TANDEMUTR events mkdir -p /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/TANDEMUTR python /projects/ps-yeolab/software/bin/miso --run /projects/ps-yeolab/genomes/hg19/miso/TANDEMUTR_index /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/M1_01.Aligned.out.sam.bam.sorted.bam --output-dir /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/TANDEMUTR --read-len $READ_LEN --settings-filename /projects/ps-yeolab/genomes/hg19/miso_annotations/miso_settings_min_event_reads10.txt -p 16 > /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/TANDEMUTR/psi.out 2> /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/TANDEMUTR/psi.err # Check that the psi calculation jobs didn't fail. #'-z' returns true when a string is empty, so this is checking that grepping these files for the words 'failed' and 'shutdown' didn't find anything. iffailed=$(grep failed /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/TANDEMUTR/psi.out) ifshutdown=$(grep shutdown /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/TANDEMUTR/psi.err) if [ ! -z "$iffailed" -o ! -z "$ifshutdown" ] ; then #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/TANDEMUTR echo "MISO psi failed on event type: TANDEMUTR" exit 1 fi # Summarize psi scores for all TANDEMUTR events python /home/yeo-lab/software/bin/run_miso.py --summarize-samples /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/TANDEMUTR /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/TANDEMUTR >/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/TANDEMUTR/summary.out 2>/oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/TANDEMUTR/summary.err # Check that the summary jobs didn't fail # '-s' returns true if file size is nonzero, and the error file should be empty. if [ -s /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/TANDEMUTR/summary.err ] ; then #rm -rf /oasis/tscc/scratch/obotvinnik/mn_diff_singlecell/fastq/miso/M1_01/TANDEMUTR echo 'MISO psi failed on event type: TANDEMUTR' exit 1 fi
To read in the summary files that are created, I wrote this function:
def read_miso_summary(filename):
'''Read a miso summary, with some additional columns
Reads a MISO summary file as a pandas dataframe, and adds these columns:
1. a copy-paste-able genome location at the end, based on the minimum
mRNA_starts and maximum mRNA_ends. (df.genome_location)
2. The difference between df.ci_high and df.ci_low (df.ci_diff)
3. The left and right halves of the confidence interval, e.g. the right
half is df.ci_high - df.miso_posterior_mean. (df.ci_left_half and
df.ci_right_half)
4. The max of the two left and right confidence interval halves
(df.ci_halves_max)
Parameters
----------
filename : str or file buffer
Name of the miso
Returns
-------
miso_summary : pandas.DataFrame
A DataFrame of the miso summary for this sample and splice type
'''
df = pd.read_table(filename)
genome_location = pd.DataFrame(
['%s:%d-%d' % (chrom, min_csv(starts), max_csv(stops))
for chrom, starts, stops in zip(df.chrom,
df.mRNA_starts,
df.mRNA_ends)],
columns=['genome_location'], index=df.index)
ci_diff = pd.DataFrame(df.ci_high - df.ci_low, columns=['ci_diff'],
index=df.index)
ci_halves = pd.DataFrame(
{'ci_left_half': (df.ci_high - df.miso_posterior_mean),
'ci_right_half': (df.miso_posterior_mean - df.ci_low)},
index=df.index)
ci_halves_max = pd.DataFrame(ci_halves.max(axis=1),
columns=['ci_halves_max'])
return pd.concat([df, genome_location, ci_diff, ci_halves,
ci_halves_max], axis=1)
def max_csv(x):
'''
Of integers separated by commas, take the max
e.g. 75112684,75112684 would return 75112684
or 75112684,75112689 would return 75112689
'''
return max(map(int, x.split(',')))
def min_csv(x):
'''
Of integers separated by commas, take the minimum
e.g. 75112684,75112684 would return 75112684
or 75112684,75112689 would return 75112684
'''
return min(map(int, x.split(',')))
The miso summaries are in /home/obotvinnik/scratch/mn_diff_singlecell/fastq/miso_v1/SAMPLE_ID/SPLICE_TYPE/summary/SPLICE_TYPE.miso_summary
Some of the AFE and ALE splicing events have the same name, so I will add a prefix of "AFE|
"/"ALE|
" to the splicing event id ("event_name
")
dfs = []
glob_command = '/home/obotvinnik/scratch/mn_diff_singlecell/fastq/miso/*/*/summary/*.miso_summary'
afe_ale = ('AFE', 'ALE')
for filename in iglob(glob_command):
# Try to read in the data, unless it doesn't work, in which case,
# just skip this file and continue on to the next
try:
df = read_miso_summary(filename)
except:
continue
# Add columns for the sample id and splice type
sample_id = filename.split('/')[7]
splice_type = os.path.basename(filename).split('.')[0]
df['sample_id'] = sample_id
df['splice_type'] = splice_type
if splice_type in afe_ale:
df.event_name = splice_type + '|' + df.event_name
dfs.append(df)
Let's see what one of these dataframes looks like.
dfs[0].head()
event_name | miso_posterior_mean | ci_low | ci_high | isoforms | counts | assigned_counts | chrom | strand | mRNA_starts | mRNA_ends | genome_location | ci_diff | ci_left_half | ci_right_half | ci_halves_max | sample_id | splice_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr20:17870855:17870999:-@chr20:17870242:17870... | 0.98 | 0.94 | 1.00 | 'chr20:17870855:17870999:-@chr20:17870242:1787... | (0,0):81,(1,0):174,(1,1):13 | 0:187 | chr20 | - | 17922242,17922855 | 17922999,17922999 | chr20:17922242-17922999 | 0.06 | 0.02 | 0.04 | 0.04 | M3_11 | TANDEMUTR |
1 | chr20:32559260:32559399:+@chr20:32559400:32562... | 1.00 | 0.99 | 1.00 | 'chr20:32559260:32559399:+@chr20:32559400:3256... | (1,0):704 | 0:704 | chr20 | + | 33095599,33095599 | 33099198,33095738 | chr20:33095599-33099198 | 0.01 | 0.00 | 0.01 | 0.01 | M3_11 | TANDEMUTR |
2 | chr20:30247080:30247914:-@chr20:30243970:30247... | 1.00 | 0.99 | 1.00 | 'chr20:30247080:30247914:-@chr20:30243970:3024... | (0,0):1,(1,0):287 | 0:287 | chr20 | - | 30780309,30783419 | 30784253,30784253 | chr20:30780309-30784253 | 0.01 | 0.00 | 0.01 | 0.01 | M3_11 | TANDEMUTR |
3 | chr20:48800432:48803519:+@chr20:48803520:48803... | 0.12 | 0.05 | 0.23 | 'chr20:48800432:48803519:+@chr20:48803520:4880... | (0,0):22,(1,0):4,(1,1):784 | 0:27,1:761 | chr20 | + | 49367025,49367025 | 49370278,49370112 | chr20:49367025-49370278 | 0.18 | 0.11 | 0.07 | 0.11 | M3_11 | TANDEMUTR |
4 | chr20:18116103:18116679:+@chr20:18116680:18117... | 0.08 | 0.00 | 0.27 | 'chr20:18116103:18116679:+@chr20:18116680:1811... | (0,0):10,(1,1):27 | 0:0,1:27 | chr20 | + | 18168103,18168103 | 18169016,18168679 | chr20:18168103-18169016 | 0.27 | 0.19 | 0.08 | 0.19 | M3_11 | TANDEMUTR |
Now let's stack these dataframes on top of each other to create a really "tall" dataframe using concat
(concatenate) from pandas
.
miso_summary = pd.concat(dfs)
miso_summary.head()
event_name | miso_posterior_mean | ci_low | ci_high | isoforms | counts | assigned_counts | chrom | strand | mRNA_starts | mRNA_ends | genome_location | ci_diff | ci_left_half | ci_right_half | ci_halves_max | sample_id | splice_type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chr20:17870855:17870999:-@chr20:17870242:17870... | 0.98 | 0.94 | 1.00 | 'chr20:17870855:17870999:-@chr20:17870242:1787... | (0,0):81,(1,0):174,(1,1):13 | 0:187 | chr20 | - | 17922242,17922855 | 17922999,17922999 | chr20:17922242-17922999 | 0.06 | 0.02 | 0.04 | 0.04 | M3_11 | TANDEMUTR |
1 | chr20:32559260:32559399:+@chr20:32559400:32562... | 1.00 | 0.99 | 1.00 | 'chr20:32559260:32559399:+@chr20:32559400:3256... | (1,0):704 | 0:704 | chr20 | + | 33095599,33095599 | 33099198,33095738 | chr20:33095599-33099198 | 0.01 | 0.00 | 0.01 | 0.01 | M3_11 | TANDEMUTR |
2 | chr20:30247080:30247914:-@chr20:30243970:30247... | 1.00 | 0.99 | 1.00 | 'chr20:30247080:30247914:-@chr20:30243970:3024... | (0,0):1,(1,0):287 | 0:287 | chr20 | - | 30780309,30783419 | 30784253,30784253 | chr20:30780309-30784253 | 0.01 | 0.00 | 0.01 | 0.01 | M3_11 | TANDEMUTR |
3 | chr20:48800432:48803519:+@chr20:48803520:48803... | 0.12 | 0.05 | 0.23 | 'chr20:48800432:48803519:+@chr20:48803520:4880... | (0,0):22,(1,0):4,(1,1):784 | 0:27,1:761 | chr20 | + | 49367025,49367025 | 49370278,49370112 | chr20:49367025-49370278 | 0.18 | 0.11 | 0.07 | 0.11 | M3_11 | TANDEMUTR |
4 | chr20:18116103:18116679:+@chr20:18116680:18117... | 0.08 | 0.00 | 0.27 | 'chr20:18116103:18116679:+@chr20:18116680:1811... | (0,0):10,(1,1):27 | 0:0,1:27 | chr20 | + | 18168103,18168103 | 18169016,18168679 | chr20:18168103-18169016 | 0.27 | 0.19 | 0.08 | 0.19 | M3_11 | TANDEMUTR |
miso_summary.shape
(4895279, 18)
Whoa! This is a HUGE DataFrame! Almost 5 million rows :)
Next, we will filter the miso events on the confidence interval, and the number of "junction reads", or the counts of reads that MISO deemed unique to only one isoform (but not both).
import re
import sys
def counts_pair_to_ints(x):
"""Convert an isoform:counts pair to integers
>>> x = '(1,0):5084,'
>>> counts_pair_to_ints(x)
((1, 0), 5084)
"""
x = x.split(':')
isoforms = tuple(map(int, x[0].strip('()').split(',')))
counts = int(x[1].rstrip(','))
return isoforms, counts
def counts_col_to_dict(counts):
"""Convert a miso counts column value into a dictionary
>>> counts = '(0,0):12,(1,0):5084,(1,1):2036'
>>> counts_col_to_dict(counts)
{(0, 0): 12, (1, 0): 5084, (1, 1): 2036}
"""
return dict(map(counts_pair_to_ints, re.findall('\(\d,\d\):\d+,?', counts)))
def filter_miso_summary(summary, ci_diff_thresh=0.5,
specific_isoform_counts_thresh=5):
"""Filter a MISO summary dataframe created by read_miso_summary on
the maximum confidence interval size, and number of
"junction reads" (reads that are specific to one isoform)
From the MISO documentation:
(1,0):28,(0,1):50,(1,1):12, where 0 and 1 correspond to the first and
second isoforms. This indicates that 28 reads support isoform 1 but not
2, while 50 reads support the second but not the first, and 12 reads
support both isoforms. The field (0,0):N, if outputted, indicates the
number of reads N that mapped to the region but were thrown out (e.g.
because they are not consistent with the isoforms in the annotation,
violate overhang constraints, or other related reasons.)
Summarized:
- (1,0) reads support isoform1 and not isoform2
- (0,1) reads support isoform2 and not isoform1
- (1,1) reads support both isoforms
- (0,0) reads support neither isoform
For "junction reads", we filter on the number of "(1,0)" and "(0,1)" reads.
Parameters
----------
summary : pandas.DataFrame
A "tall" dataframe of all samples and splice types
Returns
-------
filtered_summary : pandas.DataFrame
Only the splicing events passing the filters
"""
original_events = summary.shape[0]
summary = summary.ix[summary.ci_diff <= ci_diff_thresh]
after_ci_events = summary.shape[0]
isoform_counts = pd.DataFrame.from_dict(dict(zip(summary.index,
summary.counts.map(
counts_col_to_dict).values)),
orient='index')
# Get counts that support only one specific isoform "junction reads"
specific_isoform_counts = isoform_counts.ix[:, [(0, 1), (1, 0)]].sum(axis=1)
# Filter on at least 10 "junction reads"
summary = summary.ix[
specific_isoform_counts >= specific_isoform_counts_thresh]
after_counts_events = summary.shape[0]
# Set the index as just the range now that we've filtered everything out
summary.index = np.arange(after_counts_events)
sys.stdout.write(' {} events removed with poor confidence (ci >{:.2f})\n'
.format(after_ci_events - original_events, ci_diff_thresh))
sys.stdout.write(' {} events removed with low read counts which are unique'
' to individual isoforms (n < {})\n'.format(
after_counts_events - after_ci_events, specific_isoform_counts_thresh))
return summary
filtered_miso_summary = filter_miso_summary(miso_summary)
-713830 events removed with poor confidence (ci >0.50) -538692 events removed with low read counts which are unique to individual isoforms (n < 5)
MISO, published in 2010, was created for use with reads of length 36 bp. One of the assumptions was that none of the splicing event isoforms will be shorter than the read length. But, there are some cases where, for example in a skipped exon event, the "short" isoform of exon1 and exon3 is shorter than our read length of 100bp. And MISO starts estimating all splicing scores at 0.5 and then has to get convinced one way or another with the reads to push the percent spliced-in (Ψ) value towards 0 or 1. So, we get a bunch of splicing events whose Ψ score is about 0.5, with very high confidence (the confidence interval is tiny [<0.1] because MISO was never convinced to 0 or 1, and just bounced around 0.5)
To filter for these events, we need the lengths of each isoforms, which we will calculate using gffutils
for gff in glob('/projects/ps-yeolab/genomes/hg19/miso/*.hg19.gff3'):
print 'Creating gffutils database for {}...'.format(gff)
db_filename = gff + '.db'
try:
%time db = gffutils.create_db(gff, db_filename)
except:
# The database already exists
db = gffutils.FeatureDB(db_filename)
isoform_lengths = defaultdict(dict)
for mrna in db.features_of_type('mRNA'):
miso_id = mrna.id.rstrip('AB.')
isoform_id = mrna.id[-1]
isoform_lengths[isoform_id][miso_id] = sum(len(exon) for exon in db.children(mrna))
isoform_lengths = pd.DataFrame(isoform_lengths)
csv = gff.replace('.gff3', '.isoform_lengths.csv')
isoform_lengths.to_csv(csv)
Creating gffutils database for /projects/ps-yeolab/genomes/hg19/miso/AFE.hg19.gff3... Creating gffutils database for /projects/ps-yeolab/genomes/hg19/miso/TandemUTR.hg19.gff3... Creating gffutils database for /projects/ps-yeolab/genomes/hg19/miso/TANDEMUTR.hg19.gff3... Creating gffutils database for /projects/ps-yeolab/genomes/hg19/miso/SE.hg19.gff3... Creating gffutils database for /projects/ps-yeolab/genomes/hg19/miso/MXE.hg19.gff3... Creating gffutils database for /projects/ps-yeolab/genomes/hg19/miso/A5SS.hg19.gff3... Creating gffutils database for /projects/ps-yeolab/genomes/hg19/miso/RI.hg19.gff3... Creating gffutils database for /projects/ps-yeolab/genomes/hg19/miso/ALE.hg19.gff3... Creating gffutils database for /projects/ps-yeolab/genomes/hg19/miso/A3SS.hg19.gff3...
Some of the AFE and ALE splicing events have the same name, so I will add a prefix of "AFE|
"/"ALE|
" to the splicing event id (the index)
glob_command = '/projects/ps-yeolab/genomes/hg19/miso/*isoform_lengths.csv'
dfs = []
for filename in glob(glob_command):
splice_type = os.path.basename(filename).split('.')[0]
df = pd.read_csv(filename, index_col=0)
if splice_type in afe_ale:
df.index = splice_type + '|' + df.index
dfs.append(df)
isoform_lengths = pd.concat(dfs)
isoform_lengths.head()
A | B | |
---|---|---|
ALE|10000@uc001iab.1@uc001hzz.1 | 2153 | 268 |
ALE|10001@uc001xmg.1@uc001xmf.1 | 478 | 704 |
ALE|10002@uc002ath.1@uc002ati.1 | 918 | 705 |
ALE|100036519@uc004acn.1@uc004afx.1 | 3109 | 3109 |
ALE|10003@uc001pdd.2@uc001pde.2 | 1193 | 666 |
Now we can remove the offending splicing events. Anything which has a single isoform length smaller than our read length (100bp), we will throw out.
read_length = 100
short_isoform_lengths = isoform_lengths[isoform_lengths < read_length]
short_miso_ids = short_isoform_lengths.index[short_isoform_lengths.T.any()]
filtered_miso_summary = filtered_miso_summary.ix[~filtered_miso_summary.event_name.isin(short_miso_ids)]
Now we will create a (n_samples
× n_events
) dataframe by using the pivot()
function in pandas
:
splicing = filtered_miso_summary.pivot(columns='event_name', index='sample_id', values='miso_posterior_mean')
splicing.head()
event_name | AFE|100093631@uc003txp.1@uc003ubw.1 | AFE|100101467@uc002kyl.2@uc002kym.1 | AFE|100101467@uc002kyl.2@uc002kyn.1 | AFE|10011@uc003lfz.1@uc003lga.1 | AFE|100128553@uc010lpe.1@uc010lpc.1 | AFE|100129405@uc009wrf.1@uc001flw.2 | AFE|100129405@uc009wrf.1@uc009wre.1 | AFE|100129583@uc003hjw.1uc010ijf.1@uc003hjv.1 | AFE|100129583@uc003hjx.1uc003hjy.1@uc003hjv.1 | AFE|100130742@uc003yew.1@uc010mal.1uc003yev.1 | ... | chrY:2803518:2803810:+@chrY:2821950:2822038:+@chrY:2843136:2843285:+ | chrY:2803518:2803810:+@chrY:2829115:2829687:+@chrY:2843136:2843285:+ | chrY:2821950:2822038:+@chrY:2829115:2829687:+@chrY:2843136:2843285:+ | chrY:2843552:2843695:+@chrY:2844711:2844863:+@chrY:2845981:2846121:+ | chrY:4868267:4868646:+@chrY:4870729:4870799:+@chrY:4899947:4900052:+@chrY:4900711:4900769:+ | chrY:4868267:4868646:+@chrY:4898940:4899108:+@chrY:4899947:4900052:+ | chrY:6780129:6780213:+@chrY:6846254:6846284:+@chrY:6863845:6863939:+@chrY:6889490:6889578:+ | chrY:6846254:6846284:+@chrY:6863845:6863939:+@chrY:6889490:6889578:+ | chrY:6889490:6889578:+@chrY:6893076:6893183:+@chrY:6911021:6911166:+ | chrY:6931938:6932190:+@chrY:6934736:6934869:+@chrY:6938237:6938369:+ |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
sample_id | |||||||||||||||||||||
CVN_01 | NaN | NaN | NaN | 0.00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
CVN_02 | NaN | NaN | NaN | 0.01 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
CVN_04 | NaN | NaN | NaN | 0.05 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
CVN_05 | NaN | NaN | NaN | 0.00 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
CVN_09 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | 0.99 | 0.99 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 86511 columns
splicing.to_csv('/projects/ps-yeolab/obotvinnik/mn_diff_singlecell/data/splicing/splicing.csv')
splicing.columns.map(lambda x: x.startswith('ALE')).sum()
8913
splicing_filtered = splicing.dropna(axis=1, thresh=10)
splicing_filtered.shape
(231, 59152)
splicing_filtered.columns.map(lambda x: x.startswith('ALE')).sum()
6979
splicing.columns.map(lambda x: len(x.split('@')) == 2 and x.startswith('chr')).sum()
23107
splicing.columns.map(lambda x: len(x.split('@')) == 3).sum()
56971
splicing.columns.map(lambda x: len(x.split('@')) == 4).sum()
6146