#!/usr/bin/env python
# coding: utf-8

# SAM
# ===
# 
# This notebook explores parsing and understanding the SAM (Sequence Alignment/Map) and related BAM format.  SAM is an extremely common format for representing read alignments.  Most widely-used read alignment tools output SAM alignments.
# 
# SAM is a text format.  There is a closely related binary format called BAM.  There are two types of BAM files: unsorted or sorted.  Various tools, notably [SAMtools] and [Picard], can convert back and forth between SAM and BAM, and can sort an existing BAM file.  When we say a BAM file is sorted we almost always mean that the alignments are sorted left-to-right along the reference genome.  (It's also possible to sort a BAM file by read name, though that's only occassionally useful.)
# 
# Once you have an interesting set of read alignments that you would like to keep for a while and perhaps analyze further, it's a good idea to keep them in *sorted BAM* files.  This is because:
# 
# 1. They will be well compressed.  BAM files are smaller than corresponding SAM files, and *sorted* BAM files are smaller than corresponding *unsorted* BAM files.
# 2. From a sorted BAM file, it's easy to extract just the alignments that overlap a specified stretch of the genome, making it easy to convert from sorted BAM to many other useful formats.
# 
# That said, most alignment tools output SAM (not BAM), and the alignments come out in an arbitrary order -- not sorted.
# 
# An authoritative and complete document describing the SAM and BAM formats is the [SAM specification].  This document is thorough, but, being a specification, it does not describe is the various "dialects" of legal SAM emitted by popular tools.  I'll cover some of that here.
# 
# [SAM specification]: http://samtools.sourceforge.net/SAMv1.pdf
# [SAMtools]: http://samtools.sourceforge.net
# [Picard]: http://picard.sourceforge.net

# In[1]:


# Here's a string representing a three-line SAM file.  I'm temporarily
# ignoring the fact that SAM files usually have several header lines at
# the beginning.
samStr = '''\
r1	0	gi|9626243|ref|NC_001416.1|	18401	42	122M	*	0	0	TGAATGCGAACTCCGGGACGCTCAGTAATGTGACGATAGCTGAAAACTGTACGATAAACNGTACGCTGAGGGCAGAAAAAATCGTCGGGGACATTNTAAAGGCGGCGAGCGCGGCTTTTCCG	+"@6<:27(F&5)9)"B:%B+A-%5A?2$HCB0B+0=D<7E/<.03#!.F77@6B==?C"7>;))%;,3-$.A06+<-1/@@?,26">=?*@'0;$:;??G+:#+(A?9+10!8!?()?7C>	AS:i:-5	XN:i:0	XM:i:3	XO:i:0	XG:i:0	NM:i:3	MD:Z:59G13G21G26	YT:Z:UU
r2	0	gi|9626243|ref|NC_001416.1|	8886	42	275M	*	0	0	NTTNTGATGCGGGCTTGTGGAGTTCAGCCGATCTGACTTATGTCATTACCTATGAAATGTGAGGACGCTATGCCTGTACCAAATCCTACAATGCCGGTGAAAGGTGCCGGGATCACCCTGTGGGTTTATAAGGGGATCGGTGACCCCTACGCGAATCCGCTTTCAGACGTTGACTGGTCGCGTCTGGCAAAAGTTAAAGACCTGACGCCCGGCGAACTGACCGCTGAGNCCTATGACGACAGCTATCTCGATGATGAAGATGCAGACTGGACTGC	(#!!'+!$""%+(+)'%)%!+!(&++)''"#"#&#"!'!("%'""("+&%$%*%%#$%#%#!)*'(#")(($&$'&%+&#%*)*#*%*')(%+!%%*"$%"#+)$&&+)&)*+!"*)!*!("&&"*#+"&"'(%)*("'!$*!!%$&&&$!!&&"(*"$&"#&!$%'%"#)$#+%*+)!&*)+(""#!)!%*#"*)*')&")($+*%%)!*)!('(%""+%"$##"#+(('!*(($*'!"*('"+)&%#&$+('**$$&+*&!#%)')'(+(!%+	AS:i:-14	XN:i:0	XM:i:8	XO:i:0	XG:i:0	NM:i:8	MD:Z:0A0C0G0A108C23G9T81T46	YT:Z:UU
r3	16	gi|9626243|ref|NC_001416.1|	11599	42	338M	*	0	0	GGGCGCGTTACTGGGATGATCGTGAAAAGGCCCGTCTTGCGCTTGAAGCCGCCCGAAAGAAGGCTGAGCAGCAGACTCAAGAGGAGAAAAATGCGCAGCAGCGGAGCGATACCGAAGCGTCACGGCTGAAATATACCGAAGAGGCGCAGAAGGCTNACGAACGGCTGCAGACGCCGCTGCAGAAATATACCGCCCGTCAGGAAGAACTGANCAAGGCACNGAAAGACGGGAAAATCCTGCAGGCGGATTACAACACGCTGATGGCGGCGGCGAAAAAGGATTATGAAGCGACGCTGTAAAAGCCGAAACAGTCCAGCGTGAAGGTGTCTGCGGGCGAT	7F$%6=$:9B@/F'>=?!D?@0(:A*)7/>9C>6#1<6:C(.CC;#.;>;2'$4D:?&B!>689?(0(G7+0=@37F)GG=>?958.D2E04C<E,*AD%G0.%$+A:'H;?8<72:88?E6((CF)6DF#.)=>B>D-="C'B080E'5BH"77':"@70#4%A5=6.2/1>;9"&-H6)=$/0;5E:<8G!@::1?2DC7C*;@*#.1C0.D>H/20,!"C-#,6@%<+<D(AG-).?&#0.00'@)/F8?B!&"170,)>:?<A7#1(A@0E#&A.*DC.E")AH"+.,5,2>5"2?:G,F"D0B8D-6$65D<D!A/38860.*4;4B<*31?6	AS:i:-22	XN:i:0	XM:i:8	XO:i:0	XG:i:0	NM:i:8	MD:Z:80C4C16A52T23G30A8T76A41	YT:Z:UU'''


# In[2]:


# I'll read this string in line-by-line as though it were a file.
# I'll (lightly) parse the alignment records as I go.
import string
from io import StringIO  # reading from string rather than file
for ln in StringIO(samStr):
    qname, flag, rname, pos, mapq, cigar, rnext, \
    pnext, tlen, seq, qual, extras = str.split(ln, '\t', 11)
    print(qname, len(seq)) # print read name, length of read sequence


# ### SAM fields
# 
# As you can see from this last example, each SAM record is on a separate line, and it consists of several tab-delimited fields.  The [SAM specification] is the authoritative source for information on what all the fields mean exactly, but here's a brief summary of each of the fields (in order):
# 
# 1. `qname` is the name of the read
# 2. `flags` is a bit field encoding some yes/no pieces of information about whether and how the read aligned
# 3. `rname` is the name of the reference sequence that the read aligned to (if applicable).  E.g., might be "chr17" meaning "chromosome 17"
# 4. `pos` is the 1-based offset into the reference sequence where the read aligned.
# 5. `mapq` for an aligned read, this is a confidence value; high when we're very confident we've found the correct alignment, low when we're not confident
# 6. `cigar` indicates where any gaps occur in the alignment
# 7. `rnext` only relevant for paired-end reads; name of the reference sequence where other end aligned
# 8. `pnext` only relevant for paired-end reads; 1-based offset into the reference sequence where other end aligned
# 9. `tlen` only relevant for paired-end reads; fragment length inferred from alignment
# 10. `seq` read sequence. If read aligned to reference genome's reverse-complement, this is the reverse complement of the read sequence.
# 11. `qual` quality sequence. If read aligned to reference genome's reverse-complement, this is the reverse of the quality sequence.
# 12. `extras` tab-separated "extra" fields, usually optional and aligner-specific but often very important!
# 
# [SAM specification]: http://samtools.sourceforge.net/SAMv1.pdf
# 

# Field 1, `qname` is the name of the read.  Read names often contain information about:
# 
# 1. The scientific study for which the read was sequenced.
# 2. The sequencing instrument, and the exact [part of the sequencing instrument](http://en.wikipedia.org/wiki/FASTQ_format#Illumina_sequence_identifiers), where the DNA was sequenced.

# Field 5, `mapq`, encodes the probability *p* that the alignment reported is *incorrect*.  The probability is encoded as an integer *Q* on the Phred scale:
# 
# $$ Q = -10 \cdot \log_{10}(p) $$

# Fields 7, 8 and 9 (`rnext`, `pnext` and `tlen`) are only relevant if the read is part of a *pair*.  By way of background, sequencers can be configured to report *pairs* of DNA snippets that appear *close to each other* in the genome.  To accomplish this, the sequencer sequences *both ends* of a longer *fragment* of DNA.  When this is the case `rnext` and `pnext` tell us where the *other* end of the pair aligned, and `tlen` tells us the length of the fragment, as inferred from the alignments of the two ends.

# Field 10, `seq`, is the nucleotide sequence of the read.  The nucleotide sequence is *reverse complemented* if the read aligned to the reverse complement of the reference genome.  (This is equivalent to the reverse complement of the read aligning to the genome.)  `seq` can contain the character "`N`".  `N` essentially means "no confidence." The sequencer knows there's a nucleotide there but doesn't know whether it's an A, C, G or T.

# Field 11, `qual`, is the quality sequence of the read.  Each nucleotide in `seq` has a corresponding quality value in this string.  A nucleotide's quality value encodes the probability that the nucleotide was *incorrectly called* by the sequencing instrument and its software.  For details on this encoding, see the [FASTQ notebook](http://nbviewer.ipython.org/github/BenLangmead/comp-genomics-class/blob/master/notebooks/FASTQ.ipynb#quality-values).

# ### Flags
# 
# The `flags` field is a bitfield.  Individual bits correspond to certain yes/no properties of the alignment.  Here are the most relevant ones:
# 
# * Bit 0 (least significant): 1 if read is paired-end, 0 otherwise
# * Bit 1: for paired-end reads only: 1 if the pair aligns concordantly, 0 otherwise
# * Bit 2: 1 if read failed to align, 0 otherwise
# * Bit 3: for paried-end reads only: 1 if the other end failed to align, 0 otherwise
# * Bit 4: 1 if read aligned to Crick strand, 0 if Watson strand
# * Bit 5: for paired-end reads only: 1 if the other end aligned to Crick strand, 0 if Watson strand
# * Bit 6: for paired-end reads only: 1 if this is the first (#1) end, 0 if this is the second (#2) end
# * Bit 7: for paired-end reads only: 0 if this is the first (#1) end, 1 if this is the second (#2) end
# 
# There are a few more that are used less often; see the [SAM specification] for details.

# ### Alignment score
# 
# How do we know how *good* an alignment is, i.e., how well the read sequence matches the corresponding referene sequence.  This information is spread across a few places:
# 
# 1. The `AS:i` extra field
# 2. The `cigar` field
# 3. The `MD:Z` extra field
# 
# While the `AS:i` extra field is not required by the specification, all the most popular tools output it.  The integer appearing in this field is an *alignment score*.  The higher the score, the more similar the read sequence is to the reference sequence.
# 
# Different tools use differen scales for `AS:i`.  Sometimes (e.g. in [Bowtie 2]'s `--end-to-end` alignment mode)
# 
# [Bowtie 2]: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml

# ### End-to-end versus local alignment
# 
# Some alignment tools such as [Bowtie], [BWA] and [Bowtie 2] in `--end-to-end` mode, will attempt to align a sequencing read end-to-end.  In other words, the read will align such that *every* nucleotide of the read participates.
# 
# [Bowtie]: http://bowtie-bio.sourceforge.net/index.shtml
# [Bowtie 2]: http://bowtie-bio.sourceforge.net/bowtie2
# [BWA]: http://bio-bwa.sourceforge.net/

# ### Alignment shape

# We would like to know exactly how the read aligned to the reference, including where all the mismatches and gaps are, and which characters appear opposite the gaps.  This is not possible just by looking at the read sequence together with the CIGAR string.  That doesn't tell us what reference characters appear in the mismatched position, or in the positions involved in deletions.  Instead, we combine information from the (a) read sequence, (b) CIGAR string, and (c) `MD:Z` string.  The CIGAR and `MD:Z` strings are both described in the [SAM specification].
# 
# [SAM specification]: http://samtools.sourceforge.net/SAMv1.pdf
# 
# First we construct a function to parse the CIGAR field:

# In[3]:


def cigarToList(cigar):
    ''' Parse CIGAR string into a list of CIGAR operations.  For more
        info on CIGAR operations, see SAM spec:
        http://samtools.sourceforge.net/SAMv1.pdf '''
    ret, i = [], 0
    op_map = {'M':0, # match or mismatch
              '=':0, # match
              'X':0, # mismatch
              'I':1, # insertion in read w/r/t reference
              'D':2, # deletion in read w/r/t reference
              'N':3, # long gap due e.g. to splice junction
              'S':4, # soft clipping due e.g. to local alignment
              'H':5, # hard clipping
              'P':6} # padding
    # Seems like = and X together are strictly more expressive than M.
    # Why not just have = and X and get rid of M?  Space efficiency,
    # mainly.  The titans discuss: http://www.biostars.org/p/17043/
    while i < len(cigar):
        run = 0
        while i < len(cigar) and cigar[i].isdigit():
            # parse one more digit of run length
            run *= 10
            run += int(cigar[i])
            i += 1
        assert i < len(cigar)
        # parse cigar operation
        op = cigar[i]
        i += 1
        assert op in op_map
        # append to result
        ret.append([op_map[op], run])
    return ret


# In[4]:


cigarToList('10=1X10=')


# Next we construct a function to parse the `MD:Z` extra field:

# In[5]:


def mdzToList(md):
    ''' Parse MD:Z string into a list of operations, where 0=match,
        1=read gap, 2=mismatch. '''
    i = 0;
    ret = [] # list of (op, run, str) tuples
    while i < len(md):
        if md[i].isdigit(): # stretch of matches
            run = 0
            while i < len(md) and md[i].isdigit():
                run *= 10
                run += int(md[i])
                i += 1 # skip over digit
            if run > 0:
                ret.append([0, run, ""])
        elif md[i].isalpha(): # stretch of mismatches
            mmstr = ""
            while i < len(md) and md[i].isalpha():
                mmstr += md[i]
                i += 1
            assert len(mmstr) > 0
            ret.append([1, len(mmstr), mmstr])
        elif md[i] == "^": # read gap
            i += 1 # skip over ^
            refstr = ""
            while i < len(md) and md[i].isalpha():
                refstr += md[i]
                i += 1 # skip over inserted character
            assert len(refstr) > 0
            ret.append([2, len(refstr), refstr])
        else:
            raise RuntimeError('Unexpected character in MD:Z: "%d"' % md[i])
    return ret


# In[6]:


# Each element in the list returned by this call is itself a list w/ 3
# elements.  Element 1 is the MD:Z operation (0=match, 1=mismatch,
# 2=deletion).  Element 2 is the length and element 3 is the relevant
# sequence of nucleotides from the reference.
mdzToList('10A5^AC6')


# Now we can write a fucntion that takes a read sequennce, a parsed CIGAR string, and a parse `MD:Z` string and combines information from all three to make what I call a "stacked alignment."

# In[7]:


def cigarMdzToStacked(seq, cgp, mdp_orig):
    ''' Takes parsed CIGAR and parsed MD:Z, generates a stacked alignment:
        a pair of strings with gap characters inserted (possibly) and where
        characters at at the same offsets are opposite each other in the
        alignment.  Only knows how to handle CIGAR ops M=XDINSH right now.
    '''
    mdp = mdp_orig[:]
    rds, rfs = [], []
    mdo, rdoff = 0, 0
    for c in cgp:
        op, run = c
        skipping = (op == 4 or op == 5)
        assert skipping or mdo < len(mdp)
        if op == 0: # CIGAR op M, = or X
            # Look for block matches and mismatches in MD:Z string
            mdrun = 0
            runleft = run
            while runleft > 0 and mdo < len(mdp):
                op_m, run_m, st_m = mdp[mdo]
                run_comb = min(runleft, run_m)
                runleft -= run_comb
                assert op_m == 0 or op_m == 1
                rds.append(seq[rdoff:rdoff + run_comb])
                if op_m == 0: # match from MD:Z string
                    rfs.append(seq[rdoff:rdoff + run_comb])
                else: # mismatch from MD:Z string
                    assert len(st_m) == run_comb
                    rfs.append(st_m)
                mdrun += run_comb
                rdoff += run_comb
                # Stretch of matches in MD:Z could span M and I CIGAR ops
                if run_comb < run_m:
                    assert op_m == 0
                    mdp[mdo][1] -= run_comb
                else:
                    mdo += 1
        elif op == 1: # CIGAR op I
            rds.append(seq[rdoff:rdoff + run])
            rfs.append("-" * run)
            rdoff += run
        elif op == 2: # D
            op_m, run_m, st_m = mdp[mdo]
            assert op_m == 2
            assert run == run_m
            assert len(st_m) == run
            mdo += 1
            rds.append("-" * run)
            rfs.append(st_m)
        elif op == 3: # N
            rds.append("-" * run)
            rfs.append("-" * run)
        elif op == 4: # S
            rds.append(seq[rdoff:rdoff + run].lower())
            rfs.append(' ' * run)
            rdoff += run
        elif op == 5: # H
            rds.append('!' * run)
            rfs.append(' ' * run)
        elif op == 6: # P
            raise RuntimeError("Don't know how to handle P in CIGAR")
        else:
            raise RuntimeError('Unexpected CIGAR op: %d' % op)
    assert mdo == len(mdp)
    return ''.join(rds), ''.join(rfs)


# In[8]:


# Following example includes gaps and mismatches
cigarMdzToStacked('GGACGCTCAGTAGTGACGATAGCTGAAAACCCTGTACGATAAACC', cigarToList('12M2D17M2I14M'), mdzToList('12^AT30G0'))


# In[9]:


# Following example also includes soft clipping (CIGAR: S)
# SAM spec: Soft clipping: "clipped sequences present in SEQ"
# We print them in lowercase to emphasize their clippedness
cigarMdzToStacked('GGACGCTCAGTAGTGACGATAGCTGAAAACCCTGTACGAGAAGCC', cigarToList('12M2D17M2I8M6S'), mdzToList('12^AT25'))


# In[10]:


# Following example also includes hard clipping (CIGAR: H)
# SAM spec: Hard clipping: "clipped sequences NOT present in SEQ"
cigarMdzToStacked('GGACGCTCAGTAGTGACGATAGCTGAAAACCCTGTACGAGAAGCC', cigarToList('12M2D17M2I8M6S3H'), mdzToList('12^AT25'))
# Note: don't see hard clipping in practice much


# In[11]:


# Following example also includes skipping (CIGAR: N), as seen in
# TopHat alignments
cigarMdzToStacked('GGACGCTCAGTAGTGACGATAGCTGAAAACCCTGTACGAGAAGCC',
                  cigarToList('12M2D10M10N7M2I8M6S3H'),
                  mdzToList('12^AT25'))


# From the stacked alignment, it's easy to do other things.  E.g. we can turn a stacked alignment into a new CIGAR string that uses the `=` and `X` operations instead of the less specific `M` operation:

# In[12]:


def cigarize(rds, rfs):
    off = 0
    oplist = []
    lastc, cnt = '', 0
    for i in range(len(rds)):
        c = None
        if rfs[i] == ' ':
            c = 'S'
        elif rds[i] == '-' and rfs[i] == '-':
            c = 'N'
        elif rds[i] == '-':
            c = 'D'
        elif rfs[i] == '-':
            c = 'I'
        elif rds[i] != rfs[i]:
            c = 'X'
        else:
            c = '='
        if c == lastc:
            cnt += 1
        else:
            if len(lastc) > 0:
                oplist.append((lastc, cnt))
            lastc, cnt = c, 1
    if len(lastc) > 0:
        oplist.append((lastc, cnt))
    return ''.join(map(lambda x: str(x[1]) + x[0], oplist))


# In[13]:


x, y = cigarMdzToStacked('ACGTACGT', cigarToList('8M'), mdzToList('4G3'))


# In[14]:


cigarize(x, y)


# In[15]:


x, y = cigarMdzToStacked('GGACGCTCAGTAGTGACGATAGCTGAAAACCCTGTACGAGAAGCC',
                         cigarToList('12M2D10M10N7M2I8M6S3H'),
                         mdzToList('12^AT25'))


# In[16]:


cigarize(x, y)


# ### Other resources
# 
# * [SAM specification]
# * [SAMtools] (and associated [publication](http://bioinformatics.oxfordjournals.org/content/25/16/2078.full))
# * [pysam] (Python library)
# * [Picard] (Java library)
# * [Bamtools] (C++ library)
# * [Rsamtools] (R library)
# 
# [SAM specification]: http://samtools.sourceforge.net/SAMv1.pdf
# [SAMtools]: http://samtools.sourceforge.net/
# [pysam]: http://code.google.com/p/pysam/
# [Picard]: http://picard.sourceforge.net/
# [Bamtools]: http://sourceforge.net/projects/bamtools/
# [Rsamtools]: http://www.bioconductor.org/packages/2.14/bioc/html/Rsamtools.html
# 
# © Copyright [Ben Langmead](http://www.cs.jhu.edu/~langmea) 2014