In this recipe we look at how to find the longest run of a specified set of characters in a Sequence
object. In this particular case we'll work with DNA
sequences, and look for runs of purines and pyrimidines. In this context, I use the term run to mean a series of contiguous characters of interest (e.g., purines). We'll obtain the start and end positions of the longest stretch of various character sets.
First, we'll configure the environment:
from skbio import DNA
Next, let's load a single sequence from a fasta file. In this case, this is a short 16S sequence. In practice, you might be loading a whole genome here, but the process will be the same.
dna = DNA.read('data/single_sequence1.fasta', seq_num=1)
dna
To find the longest run of purines, we use DNA.find_motifs('purine-run')
to find all purine runs. These come back as python slice
objects, so we can find the start and stop positions of each purine run and then Python's max
function to find the longest. We can also slice our sequence object directly using the resulting slices.
purine_runs = list(dna.find_motifs('purine-run', min_length=2))
longest_purine = max(purine_runs, key=lambda x: x.stop - x.start)
dna[longest_purine]
longest_purine
To find the longest run of pyrimidines:
pyrimidine_runs = list(dna.find_motifs('pyrimidine-run', min_length=2))
longest_pyrimidine = max(pyrimidine_runs, key=lambda x: x.stop - x.start)
dna[longest_pyrimidine]
longest_pyrimidine
To find the longest run of some other character or pattern that don't have built-in support in scikit-bio's sequence objects, we can use find_with_regex
. For example, to find the longest stretch of T
characters:
t_runs = list(dna.find_with_regex('([T]+)'))
longest_t = max(t_runs, key=lambda x: x.stop - x.start)
dna[longest_t]
longest_t