Sometimes it is useful to group functions and other objects in different files. Sometimes you need to use that fancy function you've written 2 years ago. This is where modules in Python come in handy.
More officially, a module allows you to share code in the form of libraries. You've seen one example: the sys
module in the standard library. There are many other modules in the standard library, as we'll see soon.
Any Python script can in principle be imported as a module. We can import whenever we can write a valid Python statement, in a script or in an interpreter session.
If a script is called script.py
, then we use import script
. This gives us access to the objects defined in script.py
by prefixing them with script
and a dot.
Keep in mind that this is not the only way to import Python modules. Refer to the Python documentation to find out more ways to do imports.
seq_toolbox.py
as a module¶Open an interpreter and try importing your module:
import seq_toolbox
Does this work? Why?
During a module import, Python executes all the statements inside the module.
To make our script work as a module (in the intended way), we need to add a check whether the module is imported or not:
#!/usr/bin/env python
import sys
def calc_gc_percent(seq):
"""
Calculates the GC percentage of the given sequence.
Arguments:
- seq - the input sequence (string).
Returns:
- GC percentage (float).
The returned value is always <= 100.0
"""
at_count, gc_count = 0, 0
# Change input to all caps to allow for non-capital
# input sequence.
for char in seq.upper():
if char in ('A', 'T'):
at_count += 1
elif char in ('G', 'C'):
gc_count += 1
else:
raise ValueError(
"Unexpeced character found: {}. Only "
"ACTGs are allowed.".format(char))
# Corner case handling: empty input sequence.
try:
return gc_count * 100.0 / (gc_count + at_count)
except ZeroDivisionError:
return 0.0
if __name__ == '__main__':
input_seq = sys.argv[1]
print "The sequence '{}' has %GC of {:.2f}".format(
input_seq, calc_gc_percent(input_seq))
Now try importing the module again. What happens? Can you still use the module as a script?
When a module is imported, we can access the objects defined in it:
import seq_toolbox
seq_toolbox.calc_gc_percent
<function seq_toolbox.calc_gc_percent>
By the way, remember we added docstring to the calc_gc_percent
function? After importing our module, we can read up on how to use the function in its docstring:
seq_toolbox.calc_gc_percent?
seq_toolbox.calc_gc_percent('ACTG')
50.0
We can also expose an object inside the module directly into our current namespace using the from ... import ...
statement:
from seq_toolbox import calc_gc_percent
calc_gc_percent('AAAG')
25.0
Sometimes, we want to alias the imported object to reduce the chance of it overwriting any already-defined objects with the same name. This is accomplished using the from ... import ... as ...
statement:
from seq_toolbox import calc_gc_percent as gc_calc
gc_calc('AAAG')
25.0
In our case, Python imports by checking whether the module exists in the current directory. This is not the only place Python looks, however.
A complete list of paths where Python looks for modules is available via the sys
module as sys.path
. It is composed of (in order):
PYTHONPATH
environment variable.Official Python documentation: The Python Standard Library
Just to improve our knowledge, let's go through some of the most often used standard library modules.
os
module¶The Python Standard Library: 15.1. os — Miscellaneous operating system interfaces
The os
module provides a portable way of using various operating system-specific functionality. It is a large module, but the one of the most frequently used bits is the file-related functions.
import os
os.getcwd() # Get current directory.
'/home/martijn/projects/programming-course'
os.environ['PATH'] # Get the value of the environment variable PATH.
'/home/martijn/.virtualenvs/programming-course/bin:/home/martijn/projects/vcftools_0.1.11/bin:/home/martijn/projects/vcftools_0.1.11/cpp:/home/martijn/projects/muscle/muscle3.8.31/src:/home/martijn/projects/bedtools/bin:/home/martijn/projects/bamtools/bamtools/bin:/home/martijn/projects/gvnl/concordance/tabix:/home/martijn/projects/samtools-trunk:/home/martijn/projects/samtools-trunk/bcftools:/home/martijn/.venvburrito/bin:/home/martijn/coq-8.3-rc1/bin:/home/martijn/projects/kiek/trunk:/home/martijn/bin:/home/martijn/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games'
my_filename = 'input.fastq'
os.path.splitext(my_filename) # Split the extension and filename.
('input', '.fastq')
# Join the current directory and `my_filename` to create a file path.
os.path.join(os.getcwd(), my_filename)
'/home/martijn/projects/programming-course/input.fastq'
os.path.exists(my_filename) # Check whether `my_filename` exists or not.
False
os.path.isdir('/home') # Checks whether '/home' is a directory.
True
os.path.isfile('/home') # Checks whether '/home' is a file.
False
sys
module¶The Python Standard Library: 27.1. sys — System-specific parameters and functions
This module has various runtime-related and interpreter-related functions. We've seen two of the most commonly used: sys.argv
and sys.path
.
import sys
sys.path # List of places where Python looks for modules when importing.
['', '/home/martijn/.venvburrito/lib/python/distribute-0.6.49-py2.7.egg', '/home/martijn/.venvburrito/lib/python/pip-1.4.1-py2.7.egg', '/home/martijn/.venvburrito/lib/python2.7/site-packages', '/home/martijn/.venvburrito/lib/python', '/usr/local/samba/lib/python2.6/site-packages', '/home/martijn/.virtualenvs/programming-course/lib/python2.7', '/home/martijn/.virtualenvs/programming-course/lib/python2.7/plat-linux2', '/home/martijn/.virtualenvs/programming-course/lib/python2.7/lib-tk', '/home/martijn/.virtualenvs/programming-course/lib/python2.7/lib-old', '/home/martijn/.virtualenvs/programming-course/lib/python2.7/lib-dynload', '/usr/lib/python2.7', '/usr/lib/python2.7/plat-linux2', '/usr/lib/python2.7/lib-tk', '/home/martijn/.virtualenvs/programming-course/local/lib/python2.7/site-packages', '/home/martijn/.virtualenvs/programming-course/local/lib/python2.7/site-packages/gtk-2.0', '/home/martijn/.virtualenvs/programming-course/lib/python2.7/site-packages', '/home/martijn/.virtualenvs/programming-course/lib/python2.7/site-packages/gtk-2.0', '/usr/local/samba/lib/python2.6/site-packages', '/usr/local/samba/lib/python2.6/site-packages', '/home/martijn/.virtualenvs/programming-course/local/lib/python2.7/site-packages/IPython/extensions']
sys.executable # Path to the current interpreter's executable.
'/home/martijn/.virtualenvs/programming-course/bin/python'
sys.version_info # Information about our Python version.
sys.version_info(major=2, minor=7, micro=3, releaselevel='final', serial=0)
sys.version_info.major # It also provide a more granular access.
2
math
module¶The Python Standard Library: 9.2. math — Mathematical functions
Useful math-related functions can be found here. Other more comprehensive modules exist (numpy
, your lesson tomorrow), but nevertheless math
is still useful.
import math
math.log(10) # Natural log of 10.
2.302585092994046
math.log(100, 10) # Log base 10 of 100.
2.0
math.pow(3, 4) # 3 raised to the 4th power.
81.0
math.sqrt(2) # Square root of 2.
1.4142135623730951
math.pi # The value of pi.
3.141592653589793
random
module¶The Python Standard Library: 9.6. random — Generate pseudo-random numbers
The random
module contains useful functions for generating pseudo-random numbers.
import random
random.random() # Random float x, such that 0.0 <= x < 1.0.
0.05941901356497081
random.randint(2, 17) # Random integer between 2 and 17, inclusive.
13
# Random choice of any items in the given list.
random.choice(['apple', 'banana', 'grape', 'kiwi', 'orange'])
'grape'
# Random sampling of 3 items from the given list.
random.sample(['apple', 'banana', 'grape', 'kiwi', 'orange'], 3)
['orange', 'apple', 'banana']
re
module¶The Python Standard Library: 7.2. re — Regular expression operations
Regular expression-related functions are in the re
module.
import re
my_seq = 'CAGTCAGT'
results1 = re.search(r'CA.+CA', my_seq)
results1.group(0)
'CAGTCA'
results2 = re.search(r'CCC..', my_seq)
print results2
None
argparse
module¶The Python Standard Library: 15.4. argparse — Parser for command-line options, arguments and sub-commands
Using sys.argv
is neat for small scripts, but as our script gets larger and more complex, we want to be able to handle complex arguments too. The argparse
module has handy functionalities for creating command-line scripts.
argparse
¶Open your script/module in a text editor and replace import sys
with import argparse
. Remove all lines / blocks referencing sys.argv
Change the if __name__ == '__main__'
block to be the following:
if __name__ == '__main__':
# Create our argument parser object.
parser = argparse.ArgumentParser()
# Add the expected argument.
parser.add_argument('input_seq', type=str,
help="Input sequence")
# Do the actual parsing.
args = parser.parse_args()
# And show the output.
print "The sequence '{}' has %GC of {:.2f}".format(
args.input_seq,
calc_gc_percent(args.input_seq))
The code does look a little more verbose, but we get something better in return.
Go back to the shell and execute your script without any arguments. What happens?
Try executing the following command in the shell. What happens?
$ python seq_toolbox.py --help
We're just getting started on argparse
. There are other useful bits that we'll see shortly after a small intro on file I/O.
Opening files for reading or writing is done using the open
function. It is commonly used with two arguments, name and mode:
These are some of the common file modes:
r
: open file for reading (default).w
: open file for writing.a
: open file for appending content.open?
Let's go through some ways of reading from a file.
fh = open('data/short_file.txt')
fh
is a file handle object which we can use to retrieve the file contents. One simple way would be to read the whole file contents:
fh.read()
'this short file has two lines\nit is used in the example code\n'
Executing fh.read()
a second time gives an empty string. This is because we have "walked" through the file to its end.
fh.read()
''
We can reset the handle to the beginning of the file again using the seek()
function. Here, we use 0 as the argument since we want to move the handle to position 0 (beginning of the file):
fh.seek(0)
fh.read()
'this short file has two lines\nit is used in the example code\n'
In practice, reading the whole file into memory is not always a good idea. It is practical for small files, but not if our file is big (e.g., bigger than our memory). In this case, the alternative is to use the readline()
function.
fh.seek(0)
fh.readline()
'this short file has two lines\n'
fh.readline()
'it is used in the example code\n'
fh.readline()
''
More common in Python is to use the for
loop with the file handle itself. Python will automatically iterate over each line.
fh.seek(0)
for line in fh:
print line
this short file has two lines it is used in the example code
We can see that iteration exhausts the handle since we are at the end of the file after the loop.
fh.readline()
''
We can also check the file handle position using the tell()
function. If tell()
returns a nonzero number, then we are not at the beginning of the file.
fh.tell()
61
Now that we're done with the file handle, we can call the close()
method to free up any system resources still being used to keep the file open. After we closed the file, we can not use the file object anymore.
fh.close()
fh.readline()
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-55-4e86183cf03e> in <module>() ----> 1 fh.readline() ValueError: I/O operation on closed file
When writing files, we supply the w
file mode explicitely:
fw = open('data/my_file.txt', 'w')
fw
is a file handle similar to the fh
that we've seen previously. It is used only for writing and not reading, however.
fw.read()
--------------------------------------------------------------------------- IOError Traceback (most recent call last) <ipython-input-59-73497a15302b> in <module>() ----> 1 fw.read() IOError: File not open for reading
To write to the file, we use its write()
method. Remember that Python does not add newline characters here (as opposed to when you use the print
statement), so to move to a new line we have to add \n
ourselves.
fw.write('This is my first line ')
fw.write('Still on my first line\n')
fw.write('Now on my second line')
As with the r
mode, we can close the handle when we're done with it. The file can then be reopened with the r
mode and we can check its contents.
fw.close()
fr = open('data/my_file.txt') # Remember to use the same file we wrote to.
for line in fr:
print line
fr.close()
This is my first line Still on my first line Now on my second line
And finally, to remove the file, we can use the remove()
function from the os
module.
os.remove('data/my_file.txt')
When reading / writing files, we are interacting with external resources that may or may not behave as expected. For example, we don't always have permission to read / write a file, the file itself may not exist, or we have a completely wrong idea of what's in the file. In situations like these, you are encouraged to use the try ... finally
block.
The syntax is similar to try ... except
that we've seen earlier (in fact they are part of the same block, as we'll see later). Unlike try ... except
, the finally
block in try ... finally
is always executed regardless of any raised exceptions.
Let's take a look at some examples. First, the not recommended one:
f = open('data/short_file.txt')
for line in f:
print int(line)
f.close()
print 'We closed our filehandle'
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-66-f293b9e3578f> in <module>() 1 f = open('data/short_file.txt') 2 for line in f: ----> 3 print int(line) 4 f.close() 5 print 'We closed our filehandle' ValueError: invalid literal for int() with base 10: 'this short file has two lines\n'
Apart from our erroneous conversion of a line of text to an integer, the exception raised because of that causes the f.close()
statement to be not executed. At this point we have a stale open file handle.
Stubbornly trying to do the same thing again, this time we use a finally
clause:
try:
f = open('data/short_file.txt')
for line in f:
print int(line)
finally:
f.close()
print 'We closed our file handle'
We closed our file handle
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-67-71128226da53> in <module>() 2 f = open('data/short_file.txt') 3 for line in f: ----> 4 print int(line) 5 finally: 6 f.close() ValueError: invalid literal for int() with base 10: 'this short file has two lines\n'
As you can see, this way the file handle still got closed.
Now, an even better way would be to also use the catch
block, to handle the exception we might get if we try it a third time.
try:
f = open('data/short_file.txt')
for line in f:
print int(line)
except ValueError:
print 'Seems there was a line we could not handle'
finally:
f.close()
print 'We closed our file handle'
Seems there was a line we could not handle We closed our file handle
sys.stdout
, sys.stderr
, and sys.stdin
¶We've seen that the sys
module provides some useful runtime functions. Now that we know about file handles, we can use three sys
objects that are essentially file handles: sys.stdout
, sys.stderr
, and sys.stdin
.
Together, they provide access to the standard output, standard error, and standard input streams. We can use them appropriately by writing to sys.stdout
and sys.stderr
, and reading from sys.stdin
.
Unlike regular file handles, you don't need to close them after using (in fact you should not). The assumption is that these handles are always open to write to or to read from.
sys.stdout.write("I'm writing to stdout!\n")
I'm writing to stdout!
sys.stderr.write("Now to stderr.\n")
Now to stderr.
Before we go on to the exercise, let's do a final improvement on our script/module.
We want to add some extra functionality: the script should accept as its argument a path to a file containing sequences. It will then compute the GC percentage for each sequence in this file.
There are at least two things we need to do:
Open the script in your text editor, and change the if __name__ == '__main__'
block to the following:
if __name__ == '__main__':
# Create our argument parser object.
parser = argparse.ArgumentParser()
# Add argument for the input type.
parser.add_argument(
'mode', type=str, choices=['file', 'text'],
help='Input type of the script')
# Add argument for the input value.
parser.add_argument(
'value', type=str,
help='Input value of the script')
# Do the actual parsing.
args = parser.parse_args()
message = "The sequence '{}' has a %GC of {:.2f}"
if args.mode == 'file':
try:
f = open(args.value, 'r')
for line in f:
seq = line.strip()
gc = calc_gc_percent(seq)
print message.format(seq, gc)
finally:
f.close()
else:
seq = args.value
gc = calc_gc_percent(seq)
print message.format(seq, gc)
Note the things we've done here:
Save the script, and try running it. What do you see? Is running
$ python seq_toolbox.py --help
helpful to resolve this?
Try running the script with the following command. What do you see?
$ python seq_toolbox.py file data/seq.txt
Feel free to look into data/seq.txt
.
Write a script to print out the most common 7-mer and its GC percentage from all the sequences in data/records.fa
. You are free to reuse your existing toolbox.
The example FASTA file was adapted from: Genome Biology DNA60 Bioinformatics Challenge
>
character and sequence lines. We are only concerned with the sequence line.Find out how to change your script so that it can read from data/challenge.fa.gz
without unzipping the file first (hint: standard library).
Can you change the parser so that there is an option flag to tell the program whether the input file is gzipped or not?
Can you change your script so that it works for any N-mers instead of for just 7-mers?
from IPython.core.display import HTML
def custom_style():
style = open('styles/notebook.css', 'r').read()
return HTML('<style>' + style + '</style>')
def custom_script():
script = open('styles/notebook.js', 'r').read()
return HTML('<script>' + script + '</script>')
custom_style()
custom_script()