More Python Goodness (2)¶

Table of contents¶

Working with scripts
The standard library
String methods
Comments and docstrings
Detour: PEP8 and other PEPs
Errors and exceptions
Working with modules
Examples from the standard library
Reading and writing files
Assignment: Finding the most common 7-mer in a FASTA file
Further reading

Working with modules¶

Sometimes it is useful to group functions and other objects in different files. Sometimes you need to use that fancy function you've written 2 years ago. This is where modules in Python come in handy.

More officially, a module allows you to share code in the form of libraries. You've seen one example: the sys module in the standard library. There are many other modules in the standard library, as we'll see soon.

What modules look like¶

Any Python script can in principle be imported as a module. We can import whenever we can write a valid Python statement, in a script or in an interpreter session.

If a script is called script.py, then we use import script. This gives us access to the objects defined in script.py by prefixing them with script and a dot.

Keep in mind that this is not the only way to import Python modules. Refer to the Python documentation to find out more ways to do imports.

Using `seq_toolbox.py` as a module¶

Open an interpreter and try importing your module:

import seq_toolbox

Does this work? Why?

Improving our script for importing¶

During a module import, Python executes all the statements inside the module.

To make our script work as a module (in the intended way), we need to add a check whether the module is imported or not:

#!/usr/bin/env python
    import sys

    def calc_gc_percent(seq):
        """
        Calculates the GC percentage of the given sequence.

        Arguments:
            - seq - the input sequence (string).

        Returns:
            - GC percentage (float).

        The returned value is always <= 100.0
        """
        at_count, gc_count = 0, 0
        # Change input to all caps to allow for non-capital
        # input sequence.
        for char in seq.upper():
            if char in ('A', 'T'):
                at_count += 1
            elif char in ('G', 'C'):
                gc_count += 1
            else:
                raise ValueError(
                    "Unexpeced character found: {}. Only "
                    "ACTGs are allowed.".format(char))

        # Corner case handling: empty input sequence.
        try:
            return gc_count * 100.0 / (gc_count + at_count)
        except ZeroDivisionError:
            return 0.0

    if __name__ == '__main__':
        input_seq = sys.argv[1]
        print "The sequence '{}' has %GC of {:.2f}".format(
                  input_seq, calc_gc_percent(input_seq))

Now try importing the module again. What happens? Can you still use the module as a script?

Using modules¶

When a module is imported, we can access the objects defined in it:

In [1]:

import seq_toolbox

In [2]:

seq_toolbox.calc_gc_percent

Out[2]:

<function seq_toolbox.calc_gc_percent>

By the way, remember we added docstring to the calc_gc_percent function? After importing our module, we can read up on how to use the function in its docstring:

In [3]:

seq_toolbox.calc_gc_percent?

In [4]:

seq_toolbox.calc_gc_percent('ACTG')

Out[4]:

50.0

We can also expose an object inside the module directly into our current namespace using the from ... import ... statement:

In [5]:

from seq_toolbox import calc_gc_percent

In [6]:

calc_gc_percent('AAAG')

Out[6]:

25.0

Sometimes, we want to alias the imported object to reduce the chance of it overwriting any already-defined objects with the same name. This is accomplished using the from ... import ... as ... statement:

In [7]:

from seq_toolbox import calc_gc_percent as gc_calc

In [8]:

gc_calc('AAAG')

Out[8]:

25.0

(A simple guide on) How modules are discovered¶

In our case, Python imports by checking whether the module exists in the current directory. This is not the only place Python looks, however.

A complete list of paths where Python looks for modules is available via the sys module as sys.path. It is composed of (in order):

The current directory.
The PYTHONPATH environment variable.
Installation-dependent defaults.

Examples from the standard library¶

Official Python documentation: The Python Standard Library

Just to improve our knowledge, let's go through some of the most often used standard library modules.

The standard library: `os` module¶

The Python Standard Library: 15.1. os — Miscellaneous operating system interfaces

The os module provides a portable way of using various operating system-specific functionality. It is a large module, but the one of the most frequently used bits is the file-related functions.

In [9]:

import os

In [10]:

os.getcwd()    # Get current directory.

Out[10]:

'/home/martijn/projects/programming-course'

In [11]:

os.environ['PATH']    # Get the value of the environment variable PATH.

Out[11]:

'/home/martijn/.virtualenvs/programming-course/bin:/home/martijn/projects/vcftools_0.1.11/bin:/home/martijn/projects/vcftools_0.1.11/cpp:/home/martijn/projects/muscle/muscle3.8.31/src:/home/martijn/projects/bedtools/bin:/home/martijn/projects/bamtools/bamtools/bin:/home/martijn/projects/gvnl/concordance/tabix:/home/martijn/projects/samtools-trunk:/home/martijn/projects/samtools-trunk/bcftools:/home/martijn/.venvburrito/bin:/home/martijn/coq-8.3-rc1/bin:/home/martijn/projects/kiek/trunk:/home/martijn/bin:/home/martijn/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games'

In [12]:

my_filename = 'input.fastq'

In [13]:

os.path.splitext(my_filename)    # Split the extension and filename.

Out[13]:

('input', '.fastq')

In [14]:

# Join the current directory and `my_filename` to create a file path.
os.path.join(os.getcwd(), my_filename)

Out[14]:

'/home/martijn/projects/programming-course/input.fastq'

In [15]:

os.path.exists(my_filename)    # Check whether `my_filename` exists or not.

Out[15]:

False

In [16]:

os.path.isdir('/home')    # Checks whether '/home' is a directory.

Out[16]:

True

In [17]:

os.path.isfile('/home')    # Checks whether '/home' is a file.

Out[17]:

False

The standard library: `sys` module¶

The Python Standard Library: 27.1. sys — System-specific parameters and functions

This module has various runtime-related and interpreter-related functions. We've seen two of the most commonly used: sys.argv and sys.path.

In [18]:

import sys

In [19]:

sys.path    # List of places where Python looks for modules when importing.

Out[19]:

['',
 '/home/martijn/.venvburrito/lib/python/distribute-0.6.49-py2.7.egg',
 '/home/martijn/.venvburrito/lib/python/pip-1.4.1-py2.7.egg',
 '/home/martijn/.venvburrito/lib/python2.7/site-packages',
 '/home/martijn/.venvburrito/lib/python',
 '/usr/local/samba/lib/python2.6/site-packages',
 '/home/martijn/.virtualenvs/programming-course/lib/python2.7',
 '/home/martijn/.virtualenvs/programming-course/lib/python2.7/plat-linux2',
 '/home/martijn/.virtualenvs/programming-course/lib/python2.7/lib-tk',
 '/home/martijn/.virtualenvs/programming-course/lib/python2.7/lib-old',
 '/home/martijn/.virtualenvs/programming-course/lib/python2.7/lib-dynload',
 '/usr/lib/python2.7',
 '/usr/lib/python2.7/plat-linux2',
 '/usr/lib/python2.7/lib-tk',
 '/home/martijn/.virtualenvs/programming-course/local/lib/python2.7/site-packages',
 '/home/martijn/.virtualenvs/programming-course/local/lib/python2.7/site-packages/gtk-2.0',
 '/home/martijn/.virtualenvs/programming-course/lib/python2.7/site-packages',
 '/home/martijn/.virtualenvs/programming-course/lib/python2.7/site-packages/gtk-2.0',
 '/usr/local/samba/lib/python2.6/site-packages',
 '/usr/local/samba/lib/python2.6/site-packages',
 '/home/martijn/.virtualenvs/programming-course/local/lib/python2.7/site-packages/IPython/extensions']

In [20]:

sys.executable    # Path to the current interpreter's executable.

Out[20]:

'/home/martijn/.virtualenvs/programming-course/bin/python'

In [21]:

sys.version_info     # Information about our Python version.

Out[21]:

sys.version_info(major=2, minor=7, micro=3, releaselevel='final', serial=0)

In [22]:

sys.version_info.major    # It also provide a more granular access.

Out[22]:

The standard library: `math` module¶

The Python Standard Library: 9.2. math — Mathematical functions

Useful math-related functions can be found here. Other more comprehensive modules exist (numpy, your lesson tomorrow), but nevertheless math is still useful.

In [23]:

import math

In [24]:

math.log(10)    # Natural log of 10.

Out[24]:

2.302585092994046

In [25]:

math.log(100, 10)    # Log base 10 of 100.

Out[25]:

2.0

In [26]:

math.pow(3, 4)    # 3 raised to the 4th power.

Out[26]:

81.0

In [27]:

math.sqrt(2)    # Square root of 2.

Out[27]:

1.4142135623730951

In [28]:

math.pi    # The value of pi.

Out[28]:

3.141592653589793

The standard library: `random` module¶

The Python Standard Library: 9.6. random — Generate pseudo-random numbers

The random module contains useful functions for generating pseudo-random numbers.

In [29]:

import random

In [30]:

random.random()    # Random float x, such that 0.0 <= x < 1.0.

Out[30]:

0.05941901356497081

In [31]:

random.randint(2, 17)    # Random integer between 2 and 17, inclusive.

Out[31]:

In [32]:

# Random choice of any items in the given list.
random.choice(['apple', 'banana', 'grape', 'kiwi', 'orange'])

Out[32]:

'grape'

In [33]:

# Random sampling of 3 items from the given list.
random.sample(['apple', 'banana', 'grape', 'kiwi', 'orange'], 3)

Out[33]:

['orange', 'apple', 'banana']

The standard library: `re` module¶

The Python Standard Library: 7.2. re — Regular expression operations

Regular expression-related functions are in the re module.

In [34]:

import re

In [35]:

my_seq = 'CAGTCAGT'

In [36]:

results1 = re.search(r'CA.+CA', my_seq)

In [37]:

results1.group(0)

Out[37]:

'CAGTCA'

In [38]:

results2 = re.search(r'CCC..', my_seq)

In [39]:

print results2

None

The standard library: `argparse` module¶

The Python Standard Library: 15.4. argparse — Parser for command-line options, arguments and sub-commands

Using sys.argv is neat for small scripts, but as our script gets larger and more complex, we want to be able to handle complex arguments too. The argparse module has handy functionalities for creating command-line scripts.

Improving our script with `argparse`¶

Open your script/module in a text editor and replace import sys with import argparse. Remove all lines / blocks referencing sys.argv

Change the if __name__ == '__main__' block to be the following:

if __name__ == '__main__':
    # Create our argument parser object.
    parser = argparse.ArgumentParser()
    # Add the expected argument.
    parser.add_argument('input_seq', type=str,
                        help="Input sequence")
    # Do the actual parsing.
    args = parser.parse_args()
    # And show the output.
    print "The sequence '{}' has %GC of {:.2f}".format(
              args.input_seq,
              calc_gc_percent(args.input_seq))

The code does look a little more verbose, but we get something better in return.

Go back to the shell and execute your script without any arguments. What happens?

Try executing the following command in the shell. What happens?

$ python seq_toolbox.py --help

We're just getting started on argparse. There are other useful bits that we'll see shortly after a small intro on file I/O.

Reading and writing files¶

Opening files for reading or writing is done using the open function. It is commonly used with two arguments, name and mode:

name is the name of the file to open.
mode specifies how the file should be handled.

These are some of the common file modes:

r: open file for reading (default).
w: open file for writing.
a: open file for appending content.

In [40]:

open?

Reading files¶

Let's go through some ways of reading from a file.

In [41]:

fh = open('data/short_file.txt')

fh is a file handle object which we can use to retrieve the file contents. One simple way would be to read the whole file contents:

In [42]:

fh.read()

Out[42]:

'this short file has two lines\nit is used in the example code\n'

Executing fh.read() a second time gives an empty string. This is because we have "walked" through the file to its end.

In [43]:

fh.read()

Out[43]:

''

We can reset the handle to the beginning of the file again using the seek() function. Here, we use 0 as the argument since we want to move the handle to position 0 (beginning of the file):

In [44]:

fh.seek(0)

In [45]:

fh.read()

Out[45]:

'this short file has two lines\nit is used in the example code\n'

In practice, reading the whole file into memory is not always a good idea. It is practical for small files, but not if our file is big (e.g., bigger than our memory). In this case, the alternative is to use the readline() function.

In [46]:

fh.seek(0)

In [47]:

fh.readline()

Out[47]:

'this short file has two lines\n'

In [48]:

fh.readline()

Out[48]:

'it is used in the example code\n'

In [49]:

fh.readline()

Out[49]:

''

More common in Python is to use the for loop with the file handle itself. Python will automatically iterate over each line.

In [50]:

fh.seek(0)

In [51]:

for line in fh:
    print line

this short file has two lines

it is used in the example code

We can see that iteration exhausts the handle since we are at the end of the file after the loop.

In [52]:

fh.readline()

Out[52]:

''

We can also check the file handle position using the tell() function. If tell() returns a nonzero number, then we are not at the beginning of the file.

In [53]:

fh.tell()

Out[53]:

Now that we're done with the file handle, we can call the close() method to free up any system resources still being used to keep the file open. After we closed the file, we can not use the file object anymore.

In [54]:

fh.close()

In [55]:

fh.readline()

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-55-4e86183cf03e> in <module>()
----> 1 fh.readline()

ValueError: I/O operation on closed file

Writing files¶

When writing files, we supply the w file mode explicitely:

In [58]:

fw = open('data/my_file.txt', 'w')

fw is a file handle similar to the fh that we've seen previously. It is used only for writing and not reading, however.

In [59]:

fw.read()

---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-59-73497a15302b> in <module>()
----> 1 fw.read()

IOError: File not open for reading

To write to the file, we use its write() method. Remember that Python does not add newline characters here (as opposed to when you use the print statement), so to move to a new line we have to add \n ourselves.

In [60]:

fw.write('This is my first line ')

In [61]:

fw.write('Still on my first line\n')

In [62]:

fw.write('Now on my second line')

As with the r mode, we can close the handle when we're done with it. The file can then be reopened with the r mode and we can check its contents.

In [63]:

fw.close()

In [64]:

fr = open('data/my_file.txt')    # Remember to use the same file we wrote to.
for line in fr:
    print line
fr.close()

This is my first line Still on my first line

Now on my second line

And finally, to remove the file, we can use the remove() function from the os module.

In [65]:

os.remove('data/my_file.txt')

Be cautious when using file handles¶

When reading / writing files, we are interacting with external resources that may or may not behave as expected. For example, we don't always have permission to read / write a file, the file itself may not exist, or we have a completely wrong idea of what's in the file. In situations like these, you are encouraged to use the try ... finally block.

The syntax is similar to try ... except that we've seen earlier (in fact they are part of the same block, as we'll see later). Unlike try ... except, the finally block in try ... finally is always executed regardless of any raised exceptions.

Let's take a look at some examples. First, the not recommended one:

In [66]:

f = open('data/short_file.txt')
for line in f:
    print int(line)
f.close()
print 'We closed our filehandle'

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-66-f293b9e3578f> in <module>()
      1 f = open('data/short_file.txt')
      2 for line in f:
----> 3     print int(line)
      4 f.close()
      5 print 'We closed our filehandle'

ValueError: invalid literal for int() with base 10: 'this short file has two lines\n'

Apart from our erroneous conversion of a line of text to an integer, the exception raised because of that causes the f.close() statement to be not executed. At this point we have a stale open file handle.

Stubbornly trying to do the same thing again, this time we use a finally clause:

In [67]:

try:
    f = open('data/short_file.txt')
    for line in f:
        print int(line)
finally:
    f.close()
    print 'We closed our file handle'

We closed our file handle

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-67-71128226da53> in <module>()
      2     f = open('data/short_file.txt')
      3     for line in f:
----> 4         print int(line)
      5 finally:
      6     f.close()

ValueError: invalid literal for int() with base 10: 'this short file has two lines\n'

As you can see, this way the file handle still got closed.

Now, an even better way would be to also use the catch block, to handle the exception we might get if we try it a third time.

In [68]:

try:
    f = open('data/short_file.txt')
    for line in f:
        print int(line)
except ValueError:
    print 'Seems there was a line we could not handle'
finally:
    f.close()
    print 'We closed our file handle'

Seems there was a line we could not handle
We closed our file handle

Intermezzo: `sys.stdout`, `sys.stderr`, and `sys.stdin`¶

We've seen that the sys module provides some useful runtime functions. Now that we know about file handles, we can use three sys objects that are essentially file handles: sys.stdout, sys.stderr, and sys.stdin.

Together, they provide access to the standard output, standard error, and standard input streams. We can use them appropriately by writing to sys.stdout and sys.stderr, and reading from sys.stdin.

Unlike regular file handles, you don't need to close them after using (in fact you should not). The assumption is that these handles are always open to write to or to read from.

In [69]:

sys.stdout.write("I'm writing to stdout!\n")

I'm writing to stdout!

In [70]:

sys.stderr.write("Now to stderr.\n")

Now to stderr.

Improving our script to allow input from a file¶

Before we go on to the exercise, let's do a final improvement on our script/module.

We want to add some extra functionality: the script should accept as its argument a path to a file containing sequences. It will then compute the GC percentage for each sequence in this file.

There are at least two things we need to do:

Change the argument parser so that it deals with a new execution mode.
Add some statements to read from a file.

Open the script in your text editor, and change the if __name__ == '__main__' block to the following:

if __name__ == '__main__':
    # Create our argument parser object.
    parser = argparse.ArgumentParser()
    # Add argument for the input type.
    parser.add_argument(
        'mode', type=str, choices=['file', 'text'],
        help='Input type of the script')
    # Add argument for the input value.
    parser.add_argument(
        'value', type=str,
        help='Input value of the script')
    # Do the actual parsing.
    args = parser.parse_args()

    message = "The sequence '{}' has a %GC of {:.2f}"

    if args.mode == 'file':
        try:
            f = open(args.value, 'r')
            for line in f:
                seq = line.strip()
                gc = calc_gc_percent(seq)
                print message.format(seq, gc)
        finally:
            f.close()
    else:
        seq = args.value
        gc = calc_gc_percent(seq)
        print message.format(seq, gc)

Note the things we've done here:

We've added a new argument to our parser to specify the input type.
Correspondingly, we've expanded the our function call to handle both input types.

Save the script, and try running it. What do you see? Is running

$ python seq_toolbox.py --help

helpful to resolve this?

Try running the script with the following command. What do you see?

$ python seq_toolbox.py file data/seq.txt

Feel free to look into data/seq.txt.

Assignment: Finding the most common 7-mer in a FASTA file¶

Your task¶

Write a script to print out the most common 7-mer and its GC percentage from all the sequences in data/records.fa. You are free to reuse your existing toolbox.

The example FASTA file was adapted from: Genome Biology DNA60 Bioinformatics Challenge

Hints¶

FASTA files have two types of lines: header lines starting with a > character and sequence lines. We are only concerned with the sequence line.
Read the string functions documentation.
Read the documentation for built in functions.

Challenges¶

Find out how to change your script so that it can read from data/challenge.fa.gz without unzipping the file first (hint: standard library).
Can you change the parser so that there is an option flag to tell the program whether the input file is gzipped or not?
Can you change your script so that it works for any N-mers instead of for just 7-mers?

Acknowledgements¶

Wibowo Arindrarto

Martijn Vermaat

Jeroen Laros

Based on¶

Python Scientific Lecture Notes

License¶

Creative Commons Attribution 3.0 License (CC-by)