# Plane-Sweep Coverage analysis

## Michael Schatz ([email protected])

In this notebook, we will consider the problem of computing the coverage across a sequence given a set of alignments. The alignments could come from reads aligned to a genome, or the layout of reads within a newly assembled contig. It could also be used to track other non-read features, such as the gene regions or other annotations.

For example, given these five read alignments, the goal is to compute the coverage vector below it:

pos:    1    10    20   30   40   50    60    70    80     90
r1:     [==========================]
r2:          [================]
r3:                 [================================]
r4:                      [=====================]
r5:                                     [===================]

cov:    11111222222233333444444333332222333333332222221111111



And here is an example with real data: blue and red lines represent PacBio reads, with the coverage profile above them in black, and green marking the position of maximum coverage:

## The main result is with a smart algorithm and smart data structure we can accelerate the basic brute force approaches by **over 300 fold!**

Note the algorithms and data structures presented are also generally useful for many other genomic analyses working with intervals.

Part I: Brute Force Analysis

Part II: Plane Sweep Analysis

# Part I: Brute force analyiss

## 1. Extract the PacBio read placements from the Celera Assembler¶

To get started, we will first load in the layout of PacBio reads within a contig assembled using the Celera Assemler. In this example, we will be using the read placements from our recent assembly of the SKBR3 Breast Cancer Cell Line. The information of how reads are placed is stored within the "tigStore", which is an on-disk binary database. Because it is a binary database, you cannot read it directly, but instead should use the 'tigStore' command to query or update the database.

These commands will display the unitig layouts:

$cd /seq/schatz/mnattest/skbr3/local_assembly/assembly_HLA_Feb17/CA/r1$ /sonas-hs/schatz/hpc/home/gurtowsk/sources/wgs.svn/Linux-amd64/bin/tigStore -g r1.gkpStore -t r1.tigStore/ 1 -d layout -U | head -20
unitig 0
len 0
cns
qlt
data.unitig_coverage_stat  1.000000
data.unitig_microhet_prob  1.000000
data.unitig_status         X
data.unitig_suggest_repeat F
data.unitig_suggest_unique F
data.unitig_force_repeat   F
data.unitig_force_unique   F
data.contig_status         U
data.num_frags             2884
data.num_unitigs           0
FRG type R ident     18318 container         0 parent     16010 hang   2962   1054 position  19674      0
FRG type R ident      2248 container     18318 parent     18318 hang    217  -4192 position  15480    218
FRG type R ident      8536 container      2248 parent      2248 hang    609  -4597 position    827  10881
FRG type R ident     10602 container      2248 parent      2248 hang    848   -420 position  15057   1067
FRG type R ident     16010 container         0 parent     18318 hang   1054   2962 position  22636   1089
FRG type R ident      4611 container     18318 parent     18318 hang   1155  -3014 position   1157  16657



This shows the first few reads (fragments, coded as FRG) within unitig 0. Note here we are looking at the "version 1" unitigs, meaning the raw output of the unitigger based on the overlapper before consensus has run. Consequently the unitig length is listed as 0 bp, although there are many reads (2884) reads in this unitig. Also note the unitig consensus and quality strings are also both empty for this reads.

Finding the longest reads in the assembly

For the coverage analysis, the most important values are the last two numbers for the FRG records which encodes the position of the read in the contig. These positions are based on the pair-wise overlap of each read with its immediate neighbors, so the positions may change a little during consensus generation but will show the overall trends. For example, the first read in this unitig is read 18318, and spans from position 19674 to position 0. Because the start position is greater than the end, we know this read has been reverse complemented and really starts at position 0.

For PacBio assemblies, we are often interested in how well the longest reads were assembled. This command will determine which unitig has the longest read in it by annotating each FRG record with the unitig id and the read length as fields 1 and 2:

$/sonas-hs/schatz/hpc/home/gurtowsk/sources/wgs.svn/Linux-amd64/bin/tigStore -g r1.gkpStore -t r1.tigStore/ 1 -d layout -U \ | awk '{if (/^unitig/){tig=$2}else if(/^FRG/){l=$15-$14;if(l<0){l*=-1;} print tig,l,$0}}' \ | sort -nrk2 | head -10 | column -t 3 37839 FRG type R ident 5708 container 0 parent 10683 hang 122 20199 position 416262 454101 3 36519 FRG type R ident 5022 container 0 parent 167 hang 1199 9553 position 402020 365501 4 35962 FRG type R ident 3652 container 0 parent 10438 hang 199 13564 position 15094 51056 0 34550 FRG type R ident 11392 container 0 parent 11328 hang 1010 18989 position 182086 216636 15 34302 FRG type R ident 2622 container 0 parent 5938 hang 6855 7174 position 948470 914168 15 33695 FRG type R ident 16040 container 0 parent 16226 hang 5741 9003 position 539963 573658 15 33671 FRG type R ident 5938 container 0 parent 11492 hang 3720 16069 position 941296 907625 17 33504 FRG type R ident 2341 container 0 parent 9030 hang 192 15632 position 382387 415891 3 33148 FRG type R ident 8132 container 0 parent 7619 hang 1171 14849 position 162425 129277 3 32963 FRG type R ident 5377 container 0 parent 14108 hang 1184 17010 position 102322 69359  This means that unitig 3 (column 1) has the longest read at 38,839bp which is read 5708. It also has the second longest read with 36,519bp, which is read 5022. Report the layout of a selected unitig This command will report the read layout for just unitig 3, and just save the readid, start position, and end position to a file: $ /sonas-hs/schatz/hpc/home/gurtowsk/sources/wgs.svn/Linux-amd64/bin/tigStore -g r1.gkpStore -t r1.tigStore/ 1 -d layout -u 3 \
| grep '^FRG' | awk '{print $5,$14,$15}' > ~/readid.start.stop.txt  With head and tail we can check on the output format $ head -3 ~/readid.start.stop.txt
1 0 19814
2 799 19947
3 1844 13454

1871 973590 965902
1872 966703 973521
1873 973632 966946



Now we can load this file and examine the coverage across the contig

The expected file format is: readid \t startpos \t endpos

If startpos > endpos, the read has been reverse complemented

It would also be very easy to update the code to process BED, GTF, GFF or even SAM files.

Here is the example readid.start.stop.txt file from the breast cancer assembly.

In [3]:
import matplotlib.pyplot as plt
from collections import deque
import heapq
import time

## Limit the number of reads to plot

## This will have a list of tuples containing (readid, start, end, rc)
## rc is 0 if the read was originally oriented forward, 1 for reverse

totallen = 0

for line in f:
line = line.rstrip()
fields = line.split()
start   = int(fields[1])
end     = int(fields[2])
rc      = 0

if (start > end):
start = int(fields[2])
end   = int(fields[1])
rc = 1

if (end > totallen):
totallen = end

break

for i in xrange(3):

print "  ..."

print "\n\n"

Loading reads from /Users/mschatz/build/planesweep/readid.start.stop.txt
0: read1 [0 - 19814] 0
1: read2 [799 - 19947] 0
2: read3 [1844 - 13454] 0
...
1870: read1871 [965902 - 973590] 1
1871: read1872 [966703 - 973521] 0
1872: read1873 [966946 - 973632] 1



## 3. Plot the read layouts¶

In [22]:
## Note to keep it tractable, we only plot the first MAX_READ_LAYOUT reads

plt.figure(figsize=(15,4), dpi=100)

print "Plotting layout"

## draw the layout of reads
start  = r[1]
end    = r[2]
rc     = r[3]
color  = "blue"

if (rc == 1):
color = "red"

plt.plot ([start,end], [-2*i, -2*i], lw=4, color=color)

plt.draw()
plt.show()

Plotting layout


## 4. Brute force the coverage profile¶

There are several techniques for computing the coverage profile. One basic approach would be to separately compute how many reads span each position in the genome. This would take O(G * N) time where G is the length of the genome, and N is the number of reads. This is easy to see since it requires O(1) time to detemine if a read spans a given position:

if ((startpos <= querypos) and (querypos <= endpos)):
print "This read spans the query position!"



Here we do something slightly smarter which is to initialize the coverage array with all zeros, and then process each read to increment the coverage array at each position in the read. This requires O(G + N*L) where L is the average length of the read. Since L << G, this can be a huge advantage.

In [9]:
print "Brute force computing coverage over %d bp" % (totallen)

starttime = time.time()
brutecov = [0] * totallen

# print " -- [%d, %d]" % (r[1], r[2])

for i in xrange(r[1], r[2]):
brutecov[i] += 1

brutetime = (time.time() - starttime) * 1000.0

print "  Brute force complete in %0.02f ms" % (brutetime)
print brutecov[0:10]

Brute force computing coverage over 973898 bp
Brute force complete in 4435.00 ms
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


Notice that it took 4435 ms for this to complete

## 5. Compute the max coverage¶

With the coverage profile, it is trivial to compute the max coverage or other simple queries

In [11]:
maxcov = 0
lowcov = 0
LOW_COV_THRESHOLD = 10

for c in brutecov:
if c > maxcov:
maxcov = c
if c <= LOW_COV_THRESHOLD:
lowcov += 1

print "max cov: %d, there are %d low coverage bases (<= %d depth)" % (maxcov, lowcov, LOW_COV_THRESHOLD)

print "\n\n"

max cov: 49, there are 83209 low coverage bases (<= 10 depth)



## 6. Plot the read depth histogram¶

In [12]:
  plt.figure()
plt.hist(brutecov)
plt.show()


## 7. Delta encode the coverage profile¶

Imagine the coverage profile looks like this:

             ##                                    #####         ###
####                               ############### #########
############  #########  ###       ################### ###########
###############################   ###################### #############

cov: 11222234432222112222222221122210001112222333344444333333033444333322110

pos: 00000000001111111111222222222233333333334444444444555555555566666666667
01234567890123456789012345678901234567890123456789012345678901234567890



Notice that most consecutive positions have the same depth. This means we can recover exactly this plot if we just record the transitions

              #                                     #             #
## #                               #   #    #      # #  #
#   ## ##     #          #         #   #   #    #      # #  #   #
# #   ## ##   # #        # #  #   #  #   #   #    #      # #  #   # #

cov: 1 2   34 32   1 2        1 2  10  1  2   3   4    3      3 4  3   2 1 0

pos: 0 0   00 01   1 1        2 2  33  3  3   4   4    5     55 5  6   6 6 7
0 2   67 90   4 6        5 7  01  4  7   1   5    0     67 9  2   6 8 0



Here, instead of 70 depth values, we can record just 24 without loss of information!

In [13]:
print "Delta encoding coverage plot"

starttime = time.time()

deltacov = []
curcov = -1
for i in xrange(0, len(brutecov)):
if brutecov[i] != curcov:
curcov = brutecov[i]
delta = (i, curcov)
deltacov.append(delta)

## Finish up with the last position
deltacov.append((totallen, 0))

print "Delta encoding required only %d steps, saving %0.02f%% of the space in %0.02f ms" % (len(deltacov), (100.0*float(totallen-len(deltacov))/totallen), (time.time()-starttime) * 1000.0)

for i in xrange(3):
print "  %d: [%d,%d]" % (i, deltacov[i][0], deltacov[i][1])

print "  ..."

for i in xrange(len(deltacov)-3, len(deltacov)):
print "  %d: [%d,%d]" % (i, deltacov[i][0], deltacov[i][1])

print "\n\n"

Delta encoding coverage plot
Delta encoding required only 3697 steps, saving 99.62% of the space in 151.32 ms
0: [0,1]
1: [799,2]
2: [1844,3]
...
3694: [973770,2]
3695: [973779,1]
3696: [973898,0]



## 8. Plot the coverage profile¶

In theory you could plot the raw coverage profile, but python doesnt like to draw thousands and thousands of little lines. But it does fine for this representation.

In [35]:
  plt.figure(figsize=(15,4), dpi=100)

print "Plotting coverage profile"

## expand the coverage profile by this amount so that it is easier to see
YSCALE = 5

## draw the layout of reads
start  = r[1]
end    = r[2]
rc     = r[3]
color  = "blue"

if (rc == 1):
color = "red"

plt.plot ([start,end], [-2*i, -2*i], lw=4, color=color)

## draw the base of the coverage plot
plt.plot([0, totallen], [0,0], color="black")

## draw the coverage plot
for i in xrange(len(deltacov)-1):
x1 = deltacov[i][0]
x2 = deltacov[i+1][0]
y1  = YSCALE*deltacov[i][1]
y2  = YSCALE*deltacov[i+1][1]

## draw the horizonal line
plt.plot([x1, x2], [y1, y1], color="black")

## and now the right vertical to the new coverage level
plt.plot([x2, x2], [y1, y2], color="black")

plt.draw()
plt.show()

Plotting coverage profile


# Part II: Plane-sweep analysis

## 1. Plane-sweep coverage¶

Notice the coverage only changes at the beginning or end of a read so just need to walk from read to read and keep track of depth along the way

Returning to our example again:

pos:    1    10    20   30   40   50    60    70    80     90
r1:     [==========================]
r2:          [================]
r3:                 [================================]
r4:                      [=====================]
r5:                                     [===================]

cov:    11111222222233333444444333332222333333332222221111111



We can imagine scanning across this from left to right to keep track of how many reads span each position. This is a widely used approach in computational geometry called a plane-sweep algorithm.

The basic technique follows like this:

• Assume layout is in sorted order by start position (or explicitly sort by start position)
• walking from start position to start position
• check to see if we past any read ends
• coverage goes down by one when a read ends
• coverage goes up by one when new read is encountered
• use a list/deque/heap to track how many reads currently intersect the plane
• the number of elements in the list corresponds to the current depth

In this example, the plane sweeps like this:

arrive at 1

arrive at 10
10 (add 40): 40, 50 <- notice insert out of order

arrive at 20
20 (add 80): 40, 50, 80

arrive at 30
30 (add 70): 40, 50, 70, 80 <- out of order again

arrive at 60
40 (sub 40): 50, 70, 80
50 (sub 50): 70, 80
60 (add 90): 70, 80, 90

no more starts, flush the end points
70 (sub 70): 80, 90
80 (sub 80): 90
90 (sub 90): 0
In [34]:
print "Beginning list-based plane sweep over %d reads" % (len(reads))

starttime = time.time()

## record the delta encoded depth using a plane sweep
deltacovplane = []

## use a list to record the end positions of the elements currently in plane
planelist = []

## BEGIN SWEEP
startpos = r[1]
endpos   = r[2]

## clear out any positions from the plane that we have already moved past
while (len(planelist) > 0):

if (planelist[0] <= startpos):
## the coverage steps down, extract it from the front of the list
oldend = planelist.pop(0)
deltacovplane.append((oldend, len(planelist)))
else:
break

## Now insert the current endpos into the correct position into the list
insertpos = -1
for i in xrange(len(planelist)):
if (endpos < planelist[i]):
insertpos = i
break

if (insertpos > 0):
planelist.insert(insertpos, endpos)
else:
planelist.append(endpos)

## Finally record that the coverage has increased
deltacovplane.append((startpos, len(planelist)))

## Flush any remaining end positions
while (len(planelist) > 0):
oldend = planelist.pop(0)
deltacovplane.append((oldend, len(planelist)))

## Report statistics
planelisttime = (time.time() - starttime) * 1000.0
print "Plane sweep found %d steps, saving %0.02f%% of the space in %0.2f ms (%0.02f speedup)!" % (len(deltacovplane), (100.0*float(totallen-len(deltacovplane))/totallen), planelisttime, brutetime/planelisttime)

for i in xrange(3):
print "  %d: [%d,%d]" % (i, deltacovplane[i][0], deltacovplane[i][1])

print "  ..."

for i in xrange(len(deltacovplane)-3, len(deltacovplane)):
print "  %d: [%d,%d]" % (i, deltacovplane[i][0], deltacovplane[i][1])

print "\n\n"

Beginning list-based plane sweep over 1873 reads
Plane sweep found 3746 steps, saving 99.62% of the space in 48.90 ms (90.69 speedup)!
0: [0,1]
1: [799,2]
2: [1844,3]
...
3743: [973770,2]
3744: [973779,1]
3745: [973898,0]



## 2. Checking for errors¶

The above implementation is flawed because there may be multiple events (read starts and/or read ends) at the same position. This will leading to multiple entries at the same position, although there should only be one

In [17]:
print "Checking for duplicates:"

lastpos = -1
for i in xrange(len(deltacovplane)):
if deltacovplane[i][0] == lastpos:
print "  Found duplicate: %d: [%d,%d] == %d: [%d,%d]" % (i-1, deltacovplane[i-1][0], deltacovplane[i-1][1], i, deltacovplane[i][0], deltacovplane[i][1])
lastpos = deltacovplane[i][0]

print "\n\n"

Checking for duplicates:
Found duplicate: 45: [20763,38] == 46: [20763,39]
Found duplicate: 119: [36471,22] == 120: [36471,21]
Found duplicate: 158: [46334,37] == 159: [46334,36]
Found duplicate: 438: [99894,27] == 439: [99894,28]
Found duplicate: 459: [102773,30] == 460: [102773,31]
Found duplicate: 527: [118454,16] == 528: [118454,15]
Found duplicate: 529: [118455,14] == 530: [118455,13]
Found duplicate: 634: [140597,33] == 635: [140597,32]
Found duplicate: 650: [142846,29] == 651: [142846,30]
Found duplicate: 680: [148015,37] == 681: [148015,38]
Found duplicate: 734: [156928,39] == 735: [156928,40]
Found duplicate: 736: [157049,39] == 737: [157049,38]
Found duplicate: 776: [163850,35] == 777: [163850,34]
Found duplicate: 893: [192138,22] == 894: [192138,23]
Found duplicate: 925: [200780,26] == 926: [200780,27]
Found duplicate: 953: [207176,24] == 954: [207176,23]
Found duplicate: 956: [207608,21] == 957: [207608,20]
Found duplicate: 1183: [263441,28] == 1184: [263441,29]
Found duplicate: 1228: [272371,21] == 1229: [272371,20]
Found duplicate: 1274: [284943,19] == 1275: [284943,18]
Found duplicate: 1404: [321683,31] == 1405: [321683,32]
Found duplicate: 1451: [332607,12] == 1452: [332607,11]
Found duplicate: 1625: [379379,32] == 1626: [379379,33]
Found duplicate: 1679: [388495,34] == 1680: [388495,33]
Found duplicate: 1682: [388799,33] == 1683: [388799,34]
Found duplicate: 1781: [406457,30] == 1782: [406457,31]
Found duplicate: 1865: [421769,28] == 1866: [421769,27]
Found duplicate: 1957: [442758,22] == 1958: [442758,23]
Found duplicate: 1983: [448441,36] == 1984: [448441,37]
Found duplicate: 2031: [455780,36] == 2032: [455780,37]
Found duplicate: 2054: [459807,39] == 2055: [459807,38]
Found duplicate: 2101: [472150,8] == 2102: [472150,7]
Found duplicate: 2164: [493322,25] == 2165: [493322,26]
Found duplicate: 2196: [499862,21] == 2197: [499862,20]
Found duplicate: 2212: [504669,23] == 2213: [504669,24]
Found duplicate: 2233: [511527,22] == 2234: [511527,21]
Found duplicate: 2422: [598503,23] == 2423: [598503,24]
Found duplicate: 2580: [637735,23] == 2581: [637735,24]
Found duplicate: 2633: [648890,26] == 2634: [648890,27]
Found duplicate: 2640: [650223,27] == 2641: [650223,26]
Found duplicate: 2668: [656130,17] == 2669: [656130,16]
Found duplicate: 2722: [674166,25] == 2723: [674166,26]
Found duplicate: 2784: [688606,27] == 2785: [688606,26]
Found duplicate: 2909: [715460,36] == 2910: [715460,37]
Found duplicate: 3216: [826087,15] == 3217: [826087,16]
Found duplicate: 3328: [852933,31] == 3329: [852933,30]
Found duplicate: 3459: [890989,8] == 3460: [890989,7]
Found duplicate: 3642: [950674,15] == 3643: [950674,16]



## 3. Refine the plane-sweep update rule¶

Lets fix the implementation to check for multiple events at the same position by peeking ahead to check for multiple ends or multiple starts at the exact same position

Strictly speaking this may give slightly different results than the brute force coverage profile since this will create an event if a read ends at exactly the same place another begins, whereas the brute force approach will link together those regions because they have the same coverage

In [18]:
print "Beginning correct list-based plane sweep over %d reads" % (len(reads))

starttime = time.time()

## record the delta encoded depth using a plane sweep
deltacovplane = []

## use a list to record the end positions of the elements currently in plane
planelist = []

## BEGIN SWEEP (note change to index based so can peek ahead)
startpos = r[1]
endpos   = r[2]

## clear out any positions from the plane that we have already moved past
while (len(planelist) > 0):

if (planelist[0] <= startpos):
## the coverage steps down, extract it from the front of the list
oldend = planelist.pop(0)

nextend = -1
if (len(planelist) > 0):
nextend = planelist[0]

## only record this transition if it is not the same as a start pos
## and only if not the same as the next end point
if ((oldend != startpos) and (oldend != nextend)):
deltacovplane.append((oldend, len(planelist)))
else:
break

## Now insert the current endpos into the correct position into the list
insertpos = -1
for i in xrange(len(planelist)):
if (endpos < planelist[i]):
insertpos = i
break

if (insertpos > 0):
planelist.insert(insertpos, endpos)
else:
planelist.append(endpos)

## Finally record that the coverage has increased
## But make sure the current read does not start at the same position as the next
deltacovplane.append((startpos, len(planelist)))

## if it is at the same place, it will get reported in the next cycle

## Flush any remaining end positions
while (len(planelist) > 0):
oldend = planelist.pop(0)
deltacovplane.append((oldend, len(planelist)))

## Report statistics
planelisttime = (time.time() - starttime) * 1000.0
print "Plane sweep found %d steps, saving %0.02f%% of the space in %0.2f ms (%0.02f speedup)!" % (len(deltacovplane), (100.0*float(totallen-len(deltacovplane))/totallen), planelisttime, brutetime/planelisttime)

for i in xrange(3):
print "  %d: [%d,%d]" % (i, deltacovplane[i][0], deltacovplane[i][1])

print "  ..."

for i in xrange(len(deltacovplane)-3, len(deltacovplane)):
print "  %d: [%d,%d]" % (i, deltacovplane[i][0], deltacovplane[i][1])

print "\n\n"

print "Checking for duplicates in corrected version:"

lastpos = -1
for i in xrange(len(deltacovplane)):
if deltacovplane[i][0] == lastpos:
print "  Found duplicate: %d: [%d,%d] == %d: [%d,%d]" % (i-1, deltacovplane[i-1][0], deltacovplane[i-1][1], i, deltacovplane[i][0], deltacovplane[i][1])
lastpos = deltacovplane[i][0]

print "\n\n"

Beginning correct list-based plane sweep over 1873 reads
Plane sweep found 3698 steps, saving 99.62% of the space in 15.03 ms (295.17 speedup)!
0: [0,1]
1: [799,2]
2: [1844,3]
...
3695: [973770,2]
3696: [973779,1]
3697: [973898,0]

Checking for duplicates in corrected version:



## 4. Heap-based plane-sweep algorithm¶

Lists are bad for maintaining the plane because:

• removing from the front requires shifting everything over: O(N) time
• inserting into the middle (or end) also requires shifting and possibly resizing: O(N)

This can result in a worst case runtime of O(N^2) because for each read we may need to shift the list by all N positions. Fortunately, with real datasets the maximum length of the list is bounded by the maximum read depth (D), so will result in an O(N*D) runtime

If the reads (intervals) were all the same length, we would always append to the end and remove from the very front. A good data structure for this is called a deque and enables O(1) remove-from-front and O(1) append-to-end operations _ appendleft -> | | | | | | | | <-- append popleft <- |||||||___| --> pop

Deques are implemented as a doublely-linked-list, and can be used from the deque collection API

from collections import deque
d = deque()
d.append()  <- adds to the end
d.popleft() <- removes from the front



this estalishes a "queue" for the end positions and accelerates to O(N) runtime!

## Plane-sweep with a heap¶

However, in general, the intervals will not be of uniform length so this will not be sufficient. The data structure we really want should support very fast (O(1)) ability to find the minimum item of the plane, "fast" ability to remove it, and "fast" ability to add a new item to the plane

A good data structure for this is called a heap (aka min-heap, aka priority queue), and more specifically a binary heap

Binary heaps are complete binary trees such that the parent is smaller than both of the children. By construction, the height of the tree is completely balanced, except for the bottommost level which may be incomplete. This is made possible because the the relative ordering of the left and right children is arbitrary. (Note it is possible to maintain balanced binary trees in other ways, but wont support constant time look up of minimum values). In a heap it is trival to find the minimum value, since by construction it is always the top of the heap!

Removing an element

Removing the minimum is a bit more compplex and requires O(log n) steps. The first step is to shift a leaf value into the root position. If that value happens to be smaller than the children, then done. Otherwise, interatively swap it with the smaller of its two children. In the worst case, it will be reswapped down to the botton of the heap again in O(log n) steps. Note the deletions are always picked from the leaf nodes so the balance of tree will be guaranteed.

Here is a nice tutorial on removing an element

Inserting into a heap

Adding a new value into a heap is fast: the value is added as the first available leaf node and then repeatedly pushed up the heap if its value is smaller than its parents. In the worst case this could require O(log n) up-heap operations, but in average can be done in O(1) time.

Note if the values to be inserted are in sorted order, the values are easily added in O(1) time

Here is a nice tutorial on inserting an element

Representing a heap We can store a heap by embedding the tree inside of an array so that the children at node k are indexed at nodes 2k+1 and 2k+2. This is useful because then we dont have to incure the storage costs of the tree pointers. Up-heap or down-heap operations are a simple matter of swapping elements in the list, and finding the next free leaf is just the first free cell at the end of the list.

For example, the above heap can be represented as

idx    0   1   2   3   4   5   6   7   8
val    1   2   3  17  19  36   7  25 100



The next free cell is 9, which is a child of 4 (2*4+1 = 9)

In python, these techniques are implemented in the heapq package. This restores us to O(N log N) worst case runtime. However, the time to maintain the heap is bounded by the size of it, so in practice the runtime will be much faster: O(N log D) where D is the maximum depth of coverage. For more genomics project, this is effectively a constant value around log(100) = 8.

In [19]:
print "Beginning heap-based plane sweep over %d reads" % (len(reads))

starttime = time.time()

## record the delta encoded depth using a plane sweep
deltacovplane = []

## use a list to record the end positions of the elements currently in plane
planeheap = []

## BEGIN SWEEP (note change to index based so can peek ahead)
startpos = r[1]
endpos   = r[2]

## clear out any positions from the plane that we have already moved past
while (len(planeheap) > 0):

if (planeheap[0] <= startpos):
## the coverage steps down, extract it from the front of the list
## oldend = planelist.pop(0)
oldend = heapq.heappop(planeheap)

nextend = -1
if (len(planeheap) > 0):
nextend = planeheap[0]

## only record this transition if it is not the same as a start pos
## and only if not the same as the next end point
if ((oldend != startpos) and (oldend != nextend)):
deltacovplane.append((oldend, len(planeheap)))
else:
break

## Now insert the current endpos into the correct position into the list

##insertpos = -1
##for i in xrange(len(planelist)):
##  if (endpos < planelist[i]):
##      insertpos = i
##      break

##if (insertpos > 0):
##  planelist.insert(insertpos, endpos)
##else:
##  planelist.append(endpos)

heapq.heappush(planeheap, endpos)

## Finally record that the coverage has increased
## But make sure the current read does not start at the same position as the next
deltacovplane.append((startpos, len(planeheap)))

## if it is at the same place, it will get reported in the next cycle

## Flush any remaining end positions
while (len(planeheap) > 0):
##oldend = planelist.pop(0)
oldend = heapq.heappop(planeheap)
deltacovplane.append((oldend, len(planeheap)))

## Report statistics
planeheaptime = (time.time()-starttime) * 1000.0
print "Heap-Plane sweep found %d steps, saving %0.02f%% of the space in %0.02f ms (%0.02f speedup)!" % (len(deltacovplane), (100.0*float(totallen-len(deltacovplane))/totallen), planeheaptime, brutetime/planeheaptime)

for i in xrange(3):
print "  %d: [%d,%d]" % (i, deltacovplane[i][0], deltacovplane[i][1])

print "  ..."

for i in xrange(len(deltacovplane)-3, len(deltacovplane)):
print "  %d: [%d,%d]" % (i, deltacovplane[i][0], deltacovplane[i][1])

print "\n\n"

print "Checking for duplicates in heap-ified version:"

lastpos = -1
for i in xrange(len(deltacovplane)):
if deltacovplane[i][0] == lastpos:
print "  Found duplicate: %d: [%d,%d] == %d: [%d,%d]" % (i-1, deltacovplane[i-1][0], deltacovplane[i-1][1], i, deltacovplane[i][0], deltacovplane[i][1])
lastpos = deltacovplane[i][0]

print "\n\n"

Beginning heap-based plane sweep over 1873 reads
Heap-Plane sweep found 3698 steps, saving 99.62% of the space in 14.26 ms (311.08 speedup)!
0: [0,1]
1: [799,2]
2: [1844,3]
...
3695: [973770,2]
3696: [973779,1]
3697: [973898,0]

Checking for duplicates in heap-ified version:



# The heap-based plane-sweep algorithm is 311 times faster than the brute force approach!

Now lets use this framework to figure out which reads span the position of max coverage as we are scanning the alignments. The technique is to update the heap to store the read end position as the primary key, but also associate it with the read id. Then whenever we see we have established a new maximum coverage level, we record the ids of the reads in the heap.

In [20]:
print "Beginning heap-based plane sweep over %d reads to search for max reads" % (len(reads))

starttime = time.time()

## record the delta encoded depth using a plane sweep
deltacovplane = []

## use a list to record the end positions of the elements currently in plane
planeheap = []

## use a list to record the ids at max coverage
maxcov = -1
maxcovpos = -1

## BEGIN SWEEP (note change to index based so can peek ahead)
startpos = r[1]
endpos   = r[2]

## clear out any positions from the plane that we have already moved past
while (len(planeheap) > 0):

if (planeheap[0][0] <= startpos):
## the coverage steps down, extract it from the front of the list
oldend = heapq.heappop(planeheap)[0]

nextend = -1
if (len(planeheap) > 0):
nextend = planeheap[0][0]

## only record this transition if it is not the same as a start pos
## and only if not the same as the next end point
if ((oldend != startpos) and (oldend != nextend)):
deltacovplane.append((oldend, len(planeheap)))
else:
break

## Now insert the current endpos into the correct position into the list

## Finally record that the coverage has increased
## But make sure the current read does not start at the same position as the next
cov = len(planeheap)
if (cov > maxcov):
maxcov = cov
maxcovpos = startpos
for rr in planeheap:

deltacovplane.append((startpos, len(planeheap)))

## Flush any remaining end positions
while (len(planeheap) > 0):
oldend = heapq.heappop(planeheap)[0]
deltacovplane.append((oldend, len(planeheap)))

## Report statistics
planeheaptime = (time.time()-starttime) * 1000.0
print "Heap-Plane sweep found %d steps, saving %0.02f%% of the space in %0.02f ms (%0.02f speedup)!" % (len(deltacovplane), (100.0*float(totallen-len(deltacovplane))/totallen), planeheaptime, brutetime/planeheaptime)

print "The %d reads at the position of maximum depth (%d) are:" % (maxcov, maxcovpos)

print "\n\n"

for i in xrange(3):
print "  %d: [%d,%d]" % (i, deltacovplane[i][0], deltacovplane[i][1])

print "  ..."

for i in xrange(len(deltacovplane)-3, len(deltacovplane)):
print "  %d: [%d,%d]" % (i, deltacovplane[i][0], deltacovplane[i][1])

print "\n\n"

print "Checking for duplicates in heap-ified version:"

lastpos = -1
for i in xrange(len(deltacovplane)):
if deltacovplane[i][0] == lastpos:
print "  Found duplicate: %d: [%d,%d] == %d: [%d,%d]" % (i-1, deltacovplane[i-1][0], deltacovplane[i-1][1], i, deltacovplane[i][0], deltacovplane[i][1])
lastpos = deltacovplane[i][0]

print "\n\n"

Beginning heap-based plane sweep over 1873 reads to search for max reads
Heap-Plane sweep found 3698 steps, saving 99.62% of the space in 14.36 ms (308.74 speedup)!
The 49 reads at the position of maximum depth (62132) are:

0: [0,1]
1: [799,2]
2: [1844,3]
...
3695: [973770,2]
3696: [973779,1]
3697: [973898,0]

Checking for duplicates in heap-ified version:



## 6. Plot the max depth position¶

In [33]:
  f = plt.figure(figsize=(15,4), dpi=100)

print "Plotting max depth\n\n"

lastend = -1
lastidx = -1

## draw the layout of reads
start  = r[1]
end    = r[2]
rc     = r[3]
color  = "blue"

if (rc == 1):
color = "red"

plt.plot ([start,end], [-2*i, -2*i], lw=4, color=color)
lastend = end
lastidx = i

## draw the base of the coverage plot
plt.plot([0, lastend], [0,0], color="black")

## draw the coverage profile
for i in xrange(len(deltacov)-1):
x1 = deltacov[i][0]
x2 = deltacov[i+1][0]
y1  = YSCALE * deltacov[i][1]
y2  = YSCALE * deltacov[i+1][1]

## draw the horizonal line
plt.plot([x1, x2], [y1, y1], color="black")

## and now the right vertical to the new coverage level
plt.plot([x2, x2], [y1, y2], color="black")
if (x2 >= lastend):
break

## draw a vertical bar with the max coverage
plt.plot([maxcovpos, maxcovpos], [YSCALE*maxcov, -(2*lastidx+20)], color="green", lw=2)
plt.draw()
plt.show()

Plotting max depth



## 7. Summary¶

In this example, the heap-based plane sweep algorithm was 300 to 400 times faster than the basic brute force approach on my laptop! The advantage of the heap-based approach over the list based approach will become even more significant with deeper coverage as log(D) and D` become more significant differences.

In the next session we will explore the related question of how to index the alignments to be able to quickly determine which reads span a particular location. Plane-sweep algorithms are also widely used to solve many interesting problems in computational geometry