dna_string = 'AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC'
We will use a dictionary to keep track of nucleotide counts. Using dictionaries effectively is one of the keys to getting the most out of Python.
counts = {'A': 0,
'C': 0,
'G': 0,
'T': 0,
}
Iterating over a string produces the characters of the string one at a time.
for c in dna_string:
counts[c] += 1
List comprehensions are a powerful way to construct lists by peforming some operation on each of the elements in an object that can be iterated over.
count_strings = [str(counts[base]) for base in 'ACGT']
print ' '.join(count_strings)
20 12 17 21
dna_string = 'GATGGAACTTGACTACGTAAAT'
We can't directly change the characters in the string - strings are immutable.
(If you aren't familiar with the enumerate
built-in function, check out it's documentation.)
for i, c in enumerate(dna_string):
if c == 'T':
dna_string[i] = 'U'
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-7-e1e72f21c172> in <module>() 1 for i, c in enumerate(dna_string): 2 if c == 'T': ----> 3 dna_string[i] = 'U' TypeError: 'str' object does not support item assignment
We could build the string up incrementally ...
rna_string = ''
for c in dna_string:
if c == 'T':
rna_string += 'U'
else:
rna_string += c
print rna_string
GAUGGAACUUGACUACGUAAAU
... but string objects have a replace()
method that does exactly what we want.
rna_string_2 = dna_string.replace('T', 'U')
print rna_string == rna_string_2
True
dna_string = 'AAAACCCGGT'
One approach is to use a dictionary to encode how each nucleotide gets complemented and to use the built-in function reversed() to iterate over the DNA string in reversed order.
complement = {'T': 'A',
'C': 'G',
'A': 'T',
'G': 'C',
}
rc_string = ''
for c in reversed(dna_string):
rc_string += complement[c]
print rc_string
ACCGGGTTTT
It turns out that there are built-in functions in the string
module that do exactly what we want.
import string
complement_table = string.maketrans('TCAG', 'AGTC')
c_string = dna_string.translate(complement_table)
rc_string_2 = c_string[::-1]
print rc_string == rc_string_2
True
Why does [::-1] work?
up_to_ten = range(10)
Taking a slice a:b returns the interval of the list starting at index a and going up to but not including index b.
up_to_ten[2:4]
[2, 3]
Leaving off part of the slice means 'start from the beginning'...
up_to_ten[:4]
[0, 1, 2, 3]
... or 'go to the end.'
up_to_ten[4:]
[4, 5, 6, 7, 8, 9]
An optional third slice argument gives a stride length to step along the interval with.
up_to_ten[2:9:3]
[2, 5, 8]
Stride length can be negative....
up_to_ten[9:2:-2]
[9, 7, 5, 3]
... so this idiom means 'start at the end and go through to the beginning stepping backwards one index at a time', which is a complicated way to say 'reverse'.
up_to_ten[::-1]
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
zip()
is useful when you have several lists that you want to process element-wise together.
(Questions: Why do you think this is called zip? What built-in function that we have seen is equivalent to zip(range(len(seq)), seq)?)
numbers = range(4)
letters = 'ABCD'
for n, l in zip(numbers, letters):
print n, l
0 A 1 B 2 C 3 D
Lets use zip
in a filtered list comprehension to produce a compact, simple Hamming distance function.
def hamming_distance(first, second):
''' Returns the number of positions that differ between first and second. '''
distance = sum(1 for f, s in zip(first, second) if f != s)
return distance
file_name = 'data/rosalind_hamm.txt'
fh = open(file_name)
first = fh.readline().strip()
second = fh.readline().strip()
print hamming_distance(first, second)
482
The itertools
module is full of all kinds of cool stuff.
import itertools
how_many = 3
perms = itertools.permutations(range(1, how_many + 1))
In general, it doesn't make sense to ask for the length of an iterator.
print len(perms)
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-26-bb2d7ce08cd2> in <module>() ----> 1 print len(perms) TypeError: object of type 'itertools.permutations' has no len()
For iterators we know to be finite, we can 'cast' to a list.
list_perms = list(perms)
print len(list_perms)
6
To produce the desired output format, we need to convert lists (actually, tuples) of numbers to lists of strings. Let's use a list comprehension.
for perm in list_perms:
strings = [str(value) for value in perm]
print ' '.join(strings)
1 2 3 1 3 2 2 1 3 2 3 1 3 1 2 3 2 1
One way to find all occurrences of substring in string_to_search is to check if the substring is a prefix of each suffix of string_to_search.
def find_locations(substring, string_to_search):
''' Returns a list of all positions in string where substring occurs. '''
locations = []
for i in range(len(string_to_search)):
if string_to_search[i:i + len(substring)] == substring:
locations.append(i)
return locations
file_name = 'data/rosalind_subs.txt'
fh = open(file_name)
string_to_search = fh.readline().strip()
substring = fh.readline().strip()
one_based = [str(l + 1) for l in find_locations(substring, string_to_search)]
print ' '.join(one_based)
2 63 74 92 99 126 143 158 166 174 199 253 315 332 339 363 403 410 426 433 500 526 535 542 582 589 661 733 752 768 775 808 815 822 872 950
When you are searching for matches to a pattern in a string, you should think 'regular expressions'. (Of course, there is this.)
import re
re.finditer(pattern, string_to_search)
will produce (an iterator over) objects representing matches of pattern in string_to_search. These match objects have a start()
method that gives the location of the match.
pattern = substring
for match in re.finditer(pattern, string_to_search):
print match.start(),
1 62 73 91 125 142 157 173 198 252 314 331 362 402 425 499 525 534 581 660 732 751 767 807 821 871 949
A gotcha here is that overlapping pattern matches will not be found. (By default, the re engine 'consumes' the string as it moves through, so regions that are part of one match never get looked at again to be part of a potential second match.)
matches = re.finditer(pattern, string_to_search)
re_one_based = [str(m.start() + 1) for m in matches]
print re_one_based == one_based
False
Adding a (cryptic to the uninitiated) lookahead flag to the re pattern changes this behavior.
pattern = '(?={0})'.format(substring)
matches = re.finditer(pattern, string_to_search)
re_one_based_2 = [str(m.start() + 1) for m in matches]
print re_one_based_2 == one_based
True
One of the strengths of Python is the existence of high-quality modules for solving common problems in different domains, such as Biopython.
from Bio.Seq import Seq
file_name = 'data/rosalind_prot.txt'
fh = open(file_name)
dna_string = fh.readline().strip()
Make a Seq object out of dna_string.
dna_seq = Seq(dna_string)
Translate to amino acid sequence, with Biopython taking care of knowing the genetic code.
aa_seq = dna_seq.translate(to_stop=True)
Convert from a Seq object to a string.
aa_string = str(aa_seq)
print aa_string
MGMREIQTLADLSSVLGIRYRVVRRRSCQRVIKTWVSYVIRTFHPHRVGLAGLEGRSLGTHMVGANGVMLTVTTLTSLGAIAAISSRPSQSIRGSVFSPRTIYGIQAYQHIHSPAYGKTWPSRERNSRTSSGQNCAAMLLELQHPRLVTCRSRNMFLGTACLPLSGPDSVSIEIMALVEFGGVSWKRIALSVYSRMGAGSTWALTQITKHIDTGTRLAGAHCLCWARGYKFVSGPTPPANWTLEYHSPGTSYNSTSCKRGMITRRMSWHLFGSLPNLSCKIHCCSRCDHWERSIRATCFRWSQCKAQECSDIDVDHICFAARGSRKTSLASWIVGYQWVVVLMETGALKAGLRPRAMRIQSPSSVSSFGTRVAHFRSRGPIGQPSSAESMIVAYWTIAYKLCRAPYLVVERLSNWSMTVYTFRLYKTNHGIPSNIKTPSRTARAFPTWKIPTDPVTAARFGGSRCNTKPKSPRVFEADIDRCSYSCDLKAQYQLSPDEITCLLSLVISVKHPRRRLLQSAFYCPMVGLGRSLKRGCQPFTACNDWTYRPSVKGSLACCAVLLQVSSVGRFIFPKRPKTIFLGERWPRARIISCRTIKAMKQYILIRRPTVAPEVCLRTSDPMHPLRLQPRGSCEYMLYSHLSRSRKARHAIETPRAPITKLLGSRTLPPQRVEVPIDSLVPLSPLQYGRQLWVFRATVWRGFLKTRLSQDLIVQSKMQNFYMLMSLQSTPEKNRPNPKNIRKFRTQCTSLLYCRNQHLRISFPRDGLVAVRGVCYLNRPRSLKVHRYRANIVRFKQFGVITVRYPSVTTGGGEKWSPDGRYMITRVNHRQMRARLPLNDRAHLLYPLGAELIFWYVGAARIHQMADDRNHMASTGLGTNCVREALAILARGRCMPRSHATFTSVRLHLTLTVPESHLCWKELLRAFSLATRNLQLVASIKPSTKHSHTINNLRRWVWLTRWMESYLPLYLVLRRCRVTLLLLLILTLSGQQIADGSDRGNHFTTLLHLYLDTKNHGKVPAVEPSSIIPLHSKPTHSGLLMSQYHRVRSCPVISRYAYCGRQPRVHNVIARNVFRPKVSEASYVKNTSTLRHSEYSNSCGLSALIDQNIFEAIYEGNVNTCENARSVKRRVISVILPYRLLMYICRAEAATPFLLRVNTRPDGDGANININKFVATSNIEAPDYKVNDKAATVGDAEVAIGCLELWSASTFCAVIGGTQCAILLMISDCALHISVRVPLSRAEHRCIGVTGGAHVDLPFARGIRHAGWHHARVDRGYDGVTLHFRQIIRKDFGSAGIIRVDGSHDFVSKDSYTNGVYSKNFTSTDPLNEITYDTERCYRVSITGLRLTYQCRFGSQVQMQTPLTNATIELYSRVNEPRTYDRFSRKLASNYIGKLVASWPNTAVGASRNTRRLGRTVASWGSESVEQTVLSCLLSLSVLQLIESARKPSVPKCYELRALSPMTCALHTSVITATVLDVNPQRVKVWLPGFLQSDARYDVVGPPFVKRSDGKSCPLRGPYPPAISDLSPWPLHPEPLFGTRTGTQPSMYSKRVFLLPTLRVGTSQADVTRMTAYCTLRLAHPTFTGASEQVQDTIPVIRSNGAHFLKAPYPLGHKFRCTSTASAGWHVDLITASGSFTTYVRLNGSTLTIVWRSPWSTHFDKASFSTVVLLRCSFSYCGSPRGANRGLALHQRYLLPWRQGAKKRILLFVRQLCTAMNFSFIETVRLEPNYSKHATVSTAGPARTCCYVFRANAKTGLVELCSPIAYPENLIGVLELKCFRHSPSSARACQPLPVAWRPHNITEPQTIYGYTQFANFSRYSPEYSTFTGVMASLYSPGSAPIAMREEICQLALADAVGPRKSPWLGQLTPLHSYVTSTSMSTDGRWRGPRQSSLCAPGARASQPTTLWAYPSSKPLRRRQIALTRQRYKSGQTMACVLLCLAIFIGLRCNQPKINSISLVTFYVSSVILHYTHGPVLGSRFFHLSGYHAMRPRDACSIKMICGTGPSPCLRVANNTLAVVTHHIYILALDLPSSRSCLTTGRDLRATLVYVYAFPFPNGAFRGGWLLRFNPLLKPSRLYREGYIHESELGPMHASGSGGTSCPIEAGVCIDRIPQERLIQVRLLMLFVLAASLTLGAQNIAIQRARSVVYAIPLYVHKRLRLGSMLHDSTLALGYLKTRFSVFRGRRLAAASIPDLAQLTDDVRLLPHVKQSTCINDDVLQPKSKSVSGITKRIELGPNKKGRTPGVNTLKQLVGDTRGVLRKTRGLSAVSTHPTRSSNCGVYDEGKPGGTSNNLLYKTILATALLGQLVSIAYMCLLCRGFKLANITERTSGSTYCLPVIASTTPAFQPTGNRTLGSTKPTRKGELLGGMPCFCNYRPSSRSRVTTLIRVSYVCSCEINKVGTSPNLPPERLASSYFGSRSLVGLIGVAIVESSDCSRVRKLLFVLCKQPIKPLFDKTPAEPLRAVVTSLCYASIQELPPGLRLATNIRKNAILGPICIRSDHLRSGLVLQNQITRTLATILALTQWEFNPARWLPEPRSDGDDLAASPKAYDPQLCNFFDLRRRISRTCKELSNGCGSFSSIIVLAGLIVSKSYVKEQRSLSGTRIFLSFEAPAVCLMLQIRKKWINGDAMEREAPVTSCIKQKRIINAFSFAVLQQSVLSAPAPKCHHYNRCPFESSPFNAVSVSLGHSRTPYNPCSQQAGDVMTFLEKSCDQVVGIVVRSQYCIATNHVESRCGWKVPSEAWLLNKLLRSISGCTGNNARNTAWSMTRTEVHSWKWQYANGGLRGEGMKHFRQWGHLHMLGFVRDRRSEHYLVPNNPSTCDWPQSSSRAIGELHIFGVSAVRTSGLLFSPLCSGDSGSNLEGALNEYTDHAFKPVTVLTSYKGPRHTRSDRWLHDVITTAIRRRCQFTAELSPRFNAAPHMGRTCVFRCVLPTSNASLNPPMSTRHRSGFQSYNVKIRSSGTYSPGRVESPVCLWQIPLRALTNSPAPDIMFILGKQHLFALELVLRPTNEAKDRDRISYSSPAPALLVTSGGSPVSYNTWPIMNSRCSEFASTVRLTFYLSHRRSRETRVTSIIQVPKSSQCALTERTVDRTVGREAISVRQLLSHSATAVFLIKTANQAPLIRGQPSCFNGFDASHVHHWHTITRRPLQLLTGGNLYIKRHHHLWACKVSVDFWIGCPVPGAIPDVFPGGALGSEVRLGARRFVINPPRPGPLHNPQLNLGSRLDLYCGSRRVLLEKGDGRRTV
Parsing a FASTA file is a little messy. After each line that starts with a >
, we need to read an indeterminant amount of lines until we encounter the next one that starts with a >
.
This function takes in a file name and return a list of tuples of (name, sequence).
def fasta_records(file_name):
''' Returns a list of (name, seq) pairs for fasta records in file_name. '''
records = []
fh = open(file_name)
name = fh.readline().strip().lstrip('>')
while name:
seq = ''
line = fh.readline().strip()
while line and not line.startswith('>'):
seq += line
line = fh.readline().strip()
records.append((name, seq))
name = line.lstrip('>')
return records
file_name = 'data/rosalind_gc.txt'
records = fasta_records(file_name)
Argument unpacking makes code more readable.
for name, seq in records:
print name
print seq
print
Rosalind_8336 GCTCACCCGTGTCGCCCGAACATGCTTAACTAACTTGGACAGGTTTGAAATAGCACTTGAAGAAAACGTTTCAGCGATGACCAGACGAGGATACGATCGCTCCTCGGAAGCCCTCTAAGCTCGTCGCAGTGAACAAGACTCTAAGAGTCTTCAGACGCTTGGCGCCGGTAGGTGGTTCTGTGCACTACTCCACGTCGTCCGGTCGAGTAATAAGCAGGGATCTTTCGGCGTTGTACTAACAGCTCATAACTCCAAAGCCAAAAACTCAACTGGGAAAGCGTAGTGTCAGATGACTGTACACTCACACAAGTACGGTCGGCAGGGGCACCGTTACGCCCCAGACCGTTGCGTATTTCTGCAGTCCGGCGTAAGTAGATTGTCCGTGGGCTTACTCTTGGAATTATATCATGTCTCTGGCTTGGAATTGGCGTTACATTGTGGGGACCCGCTGCGGCCGTCCCTATTCATATCATTTTTGAGCATAATGATTTTTTACAACAAAAATGTAGCACGGAATGCCGAATTTAAGTCAGTTTACATCAGTGAAGCGGTTTGCCCAACGGCCAGACTGCGTGGGGCACAGTTGGCCTATGTAGGACCACTCGTGCGCATTAGAACTTCAACCCGTTGACTACGCCGGAGCGGGACAACGTAATTCATTGATCACTGGACCCGACAGTACGGGTGGACACAATGTGTCAATGTAGGCCAGCATAGTTAAACATCGAGCCCCGTGTGACAGAAGTTTGAAGCGATGTAGTGATGGTCTTCACTTGTCATGACGGTGGGGCCTCGGGAAAGTTGGCATCCCTACAGCGTTTCAGATGCGGTCCAATGCTAGCGGACAATTCTAGCACGTTCTAGGCTGCCGCTAGGTCAACCCGCCACATGATCCACTGAAAGT Rosalind_6236 TATGCAAATTTATGGCGCGGCCATTGAGTAAGACTTCAGCAGCACCTGATGTTCGCATCTAAAACCTGTTAGGTACTGTATTGCAATCGAATATTGTGACCCGGTCCACACGCACGCATACTTGTGCTATGAACAATATCAACTACCCCACACAAGCGTTGGCGGTTGGGTACTCGAAACTATAAGATTCGCCCATGTGTGTAAAGGCGTTGTCATCTCACTGAAACCTTGCCTCGAGTCCTAGAGAAACTGTTTAGAATGAACCGCTATCGGCCAGCGATGCCCCTCCTCTGTCTGGCCGTAATTGGGGCCTAGACTCCGATGTATCTGATGTATGCACGGAAAAAGCTCCACGCAGGTTACGGAGGGTGCGCTAATAATGGTCACAGCGGGTCGATGGGGTTACGTTATACATTCCCAAGCAGTTCGACATCCCGCAGATCGAATACACACTGTCATGCAGCGTCACTGCCCACCTGCACCGCCGGTTAAACAAGAATTAGTTCCGGTCCTCCGCCTGGGTGATGGGTAACTTCGGCATGAGGTTACAACCACGAGATCGACAAATGCCTCACCGGAGGAGATTCGAACCGCGTGGGATGTTCGAATTGGCGCCCCGCATAGTTCCGTTCCAATATCATCTTTTGGCCACTGATCCCTGCGGGTTCTTGCCGGAGCCGGTTGAAATGTGGGCGCGACCTGGCAACATTTGAATGGCAGATCCAGACCACAACTGTAGGAAACCCTCCTATTCTAGTCTGGATTCAACTACCGTGCACCACAGTAGTTGGACTGAACGGACAGCTTCCGTCTGGGCCTTTCCTTCAAGGGTTGCGAATTAACGCCCTTCCCCTCTTCACTAGTCAGCGTTCGCGGACCCTAACGAAGACGAGAGATCATCATTGGTGTACTAGCGTC Rosalind_9301 GATATGAGGTGACCTTTCGGGTATCGAACAGCGAAACAACGGTGGCATAAGCTCTCTCTGCCTTTGATATTTCCCAAGTGCGTGTCTGTGTGGATGAGCCTTCGACCAGTCTACCACTCAACAGTGAGTTAAAAGCCAAATTACAAATTTTGCGGTTTCCCTTCTTGGTGCGCATGCCTATCAAGCTGCCGTACTTTAAGAGACGGCAAACCTCACAGTACGCCAAGGGGTTATTCGTCTTGTCCGCGGATAGGATATGTCAGCGCGCGATAACGAGCTCATCCAAAGATTCGCGTAAGCATTGCCAGGTCGTAAGGGCTTAGCAGTCGAACTTTATTAGTATCCAAATCAGTCAAGGAGACGGATTCCCTTAACGCCTACCTGATAAACTACCTAGCCCTGTTACGCCTGCGAGAGATACCCGCGGCGAGAGTCGGATGCTCGATGCCGGTAGCCTGTCTAACGGTATACTAGTAAAATCGGGTGTCAGCAAATGTAGGTTACACACGCGTTTAAGTAACGCTGACAAAGAACCTGTCCGTGCTTCTTGCCAACACTATCTTCATCGTATTCTACTTAAGCGTGCACACGGGTCCTCTTAGACCTAGTTATTTGTGTGAAGTAGACCGCTGGGCGAGTAAGCCGGTTTATTCTCCGCAGACTTGGCCAACTCGTAGTAGATCTCACAGAATCCCATCAGGTATAACACAGGCAACTGGGCGGCGGCTCAAGAAGCCGGGATAACTTAGGGGGCTATTCGGAGCTGTCTTGGGGAGCTGGTATCGAGTAGCGAGTTAATAATACTTGCAAA Rosalind_6955 TGACTTATCAAAGACCTAGGAAGTTTCAGGCCTCTCTTAGTATGCCACCCACGCTGGACGGCGCTGAATGCCTCAACCAGAACGACCCGTAAACAAAAACCGGAATCGGGCTGGGATGTTCTAGGCGGTAAGGTGCGATGCTGTAGCTTATTTCGCACACTCCCCTAATTCATAAGAAACGTAACGCCGTAGCCTAATTTACTCCGATCGAAACGCATGAGTCGTGCGTTTGGGGGAGCAAAGATCATCATTGAGGCGATGGGCCGCCAAGTAAGCATGGTGGGTGTGGGTCTTAGCGGCTGCTTGGGGCGTGGACGTGTCCGCGCTATACCCGAAATTTTGCGGCAAGCTCGGCCTGGGAAGTGAGCTCCAGCGAACGTAGTATCTGGTGCGGACAGACTTAGCAGTATCAAAAAACGTCGGTCCCATCTTACATAGGATCTTTAACTGTATGTGCCATCGCATGCGTAACTTCAGTCGCTCGGTTCCGACGGAGCGGACAGGTCCGTGAACGGTGACCCCATGCAAGCCAGAACTCCAGGGATCGCCGGCGGGCGTCTGAAGGAACAAACACCAATGTAACATGCTTAACCCGCGCGATGTTTAAAACTCTGCACCATGAGCTTAGATACTGGTGACCATTCCGAGCCGCCCTGATCCGAATCTCTTGTAAACTGTCGGAGGGTACTATTGGGGCATGTGTGGAGGACCGAGATGTACGAATGGATATAGCTTAACGCTGCTGCACGAAATCGACTGTTACGTCTCACCTGTTCGGCCTTAGCGTCAGACCTGAACCCAGAGGTGCGCGTTACATTGAGGACCGGTACGAAATTGGCTCTGATTATAAGTCATCGTAAGAGGATTACAAGCGCATCTGCGTGGCTCTCGTGCCGTTCGTTTTGGAACACCTCCGATTGGAGGTTATCAGTCAGCCCATCAATATCAGTGCTAACCAGCTCGTGGCGCCATGCCCGTTACCCGAAGATTCAACC Rosalind_0896 GCTGCAAAACGTACTCAACAGAATAGTAACCAGCAATCCGGATTCGCATGCTAAGTAGTATCAGGAGCCACGATTATTGCCTGCGCTTGCGTATGGCAGCCAGTTGTGAAGCGGTGAAGGGATATGATAGTAAGAAATCTTTAGAACCAGAGTGCATGACATGTCCGACCCTTTTGGGCCCATTATCCGCGACGTCCAGTCGTACCGTACCGGCGGCCCCTTAATTCTAGCTAAAGTTAGTCTCATAGACGTACGGTCAACCATCTGTCGTTGTCGCAGCAGATTAATCACGCTTATTAAGATCCGCTTCTACATCCACGTCTCTAGTACCTGCTGCCGACGCATCCGCTAACTATGCTTCTTGTGCAGTCGTCTATGCGCCGAGCCCATTATCTAATCTCTGCATACTTGCGTGTCCACGGATTTCCGTATTCTAACCCTCGGATCTACGTTTGCCGGCTGTCGACCGATCTTTAGGCTCACAGTCATCCGACCTGATCTGGGTCTAGTGCCGGTTATTACAGTTGACAGGCCATGCGTTCCACGGTGGAGACTTATACGTTGGGTGGAATGCTTCTTTTAACTGTGCATTTGATCCCTGCGTCAGTCTTTTTCAGACCTTAAAACCGACCGTGCTTGACGGCGAGTCACCGTCCATATAATTGTAGGACCCCGCTCTAGACTCGTGAGGTCATAGTGTAGGCCAAAGAGGTTGTTGTGCTCGTCTGTACTACCAACAGCTCTAGAAGGACGCGTGCAGTCATGTGGAGATGGGGGGGCCGCACGTTGAAAGGACCGGCAAGTCCACGACCCAACAAGCGTTTCGATACCCAAGCAGGCAGATTCATGTTACAATCGTGCGGCTCAGTACCAATTGCATAGTGTCACAGACATTGATATGTCTAAGTACTTATCCCGGGATTGCCACCTGACAGTTT
Conditional expressions can be used to filter list comprehensions.
test_seq = 'GCATATATGCTAG'
gcs = [char for char in test_seq if char == 'G' or char == 'C']
print gcs
['G', 'C', 'G', 'C', 'G']
gc_count = sum(1 for char in test_seq if char == 'G' or char == 'C')
print gc_count
5
def gc_content(seq):
''' Returns the GC content of seq as a percentage. '''
gc_count = sum(1 for char in seq if char == 'G' or char == 'C')
return 100 * float(gc_count) / len(seq)
lambda expressions can be used to create anonymous functions.
We will use a lambda expression to tell max
to sort a list of tuples by the second element in each tuple.
records_gc = [(name, gc_content(seq)) for name, seq in records]
name, gc = max(records_gc, key=lambda (name, gc): gc)
print name
print gc
Rosalind_6955 52.6633165829
I have placed a file gc_content.py
in the same folder that this notebook is in. Now I can import gc_content
and have access to all its guts in a namespace called 'gc_content'.
import gc_content
This import
statement evaluates all of the code in gc_content
. Sometimes this is not what you want to happen. The file being imported will have some stuff it is supposed to do on its own that you don't care about when you are importing it. The if __name__ = '__main__':
idiom takes care of this problem.
!cat gc_content.py
def fasta_records(fh): ''' Yields (name, seq) pairs for fasta records in the file-like object fh. ''' name = fh.readline().strip().lstrip('>') while name: seq = '' line = fh.readline().strip() while line and not line.startswith('>'): seq += line line = fh.readline().strip() yield name, seq name = line.lstrip('>') def gc_content(seq): ''' Returns the GC content of seq as a percentage. ''' gc_count = sum(1 for char in seq if char == 'G' or char == 'C') return 100 * float(gc_count) / len(seq) if __name__ == '__main__': file_name = '../data/rosalind_gc.txt' fh = open(file_name) records = [(name, gc_content(seq)) for name, seq in fasta_records(fh)] name, gc = max(records, key=lambda (name, gc): gc) print name print gc
Notice that the fasta_records
has been changed to accept a file-like object instead of a file name, for reasons that will be clear below.
If we write docstrings for our functions, we can get reminders about what they do.
This is what pops up in your browser if you were to evaluate gc_content.fasta_records?
.
gc_content.fasta_records.__doc__
' Yields (name, seq) pairs for fasta records in the file-like object fh. '
Notice the yield
keyword in the fasta_records
code. That makes it a special kind of function called a generator. Generators are a convenient way to define your own iterators.
def simple_generator(n):
for i in range(n):
yield i
my_gen = simple_generator(4)
print [number**2 for number in my_gen]
[0, 1, 4, 9]
urllib
provides a way to read files from the web as seamlessly as if they were local.
import urllib
opened_url = urllib.urlopen('http://www.uniprot.org/uniprot/A2Z669.fasta')
name, seq = gc_content.fasta_records(opened_url).next()
print name
print seq
sp|A2Z669|CSPLT_ORYSI CASP-like protein OsI_33147 OS=Oryza sativa subsp. indica GN=OsI_33147 PE=2 SV=1 MRASRPVVHPVEAPPPAALAVAAAAVAVEAGVGAGGGAAAHGGENAQPRGVRMKDPPGAPGTPGGLGLRLVQAFFAAAALAVMASTDDFPSVSAFCYLVAAAILQCLWSLSLAVVDIYALLVKRSLRNPQAVCIFTIGDGITGTLTLGAACASAGITVLIGNDLNICANNHCASFETATAMAFISWFALAPSCVLNFWSMASR
Regular expressions make it easy to look for the N-glycosylation motif.
import urllib
motif = re.compile(r'(?=(N[^P][ST][^P]))')
file_name = 'data/rosalind_mprt.txt'
fh = open(file_name)
for line in fh:
uniprot_id = line.strip()
url = 'http://www.uniprot.org/uniprot/{0}.fasta'.format(uniprot_id)
opened_url = urllib.urlopen(url)
for name, seq in gc_content.fasta_records(opened_url):
ps = [str(match.start() + 1) for match in motif.finditer(seq)]
if ps:
print uniprot_id
print ' '.join(ps)
Q3T0C9 15 38 Q32LI2 157 P00740_FA9_HUMAN 203 213 P04233_HG2A_HUMAN 130 136 256 270 P22891_PRTZ_HUMAN 99 225 233 306 332 P80195_MPP3_BOVIN 95 P02974_FMM1_NEIGO 67 68 121 B3CNE0 107 P98119_URT1_DESRO 153 398 P21810_PGS1_HUMAN 270 311 Q9D9T0 154 Q4FZD7 528 P19823_ITH2_HUMAN 118 445