#!/usr/bin/env python # coding: utf-8 # # More Python Goodness (2) # *** # ## Table of contents # # 1. Working with scripts # 2. The standard library # 3. String methods # 4. Comments and docstrings # 5. Detour: PEP8 and other PEPs # 6. Errors and exceptions # 7. [Working with modules](#modules) # 8. [Examples from the standard library](#stdlib-examples) # 9. [Reading and writing files](#io) # 10. [Assignment: Finding the most common 7-mer in a FASTA file](#assignment) # 11. [Further reading](#further) # # ## Working with modules # # Sometimes it is useful to group functions and other objects in different files. Sometimes you need to use that fancy function you've written 2 years ago. This is where modules in Python come in handy. # # More officially, a **module** allows you to share code in the form of libraries. You've seen one example: the `sys` module in the standard library. There are many other modules in the standard library, as we'll see soon. # ### What modules look like # # Any Python script can in principle be imported as a module. We can import whenever we can write a valid Python statement, in a script or in an interpreter session. # # If a script is called `script.py`, then we use `import script`. This gives us access to the objects defined in `script.py` by prefixing them with `script` and a dot. # # Keep in mind that this is not the only way to import Python modules. Refer to the Python documentation to find out more ways to do imports. # ### Using `seq_toolbox.py` as a module # # Open an interpreter and try importing your module: # # ```python # import seq_toolbox # ``` # Does this work? Why? # ### Improving our script for importing # # During a module import, Python executes all the statements inside the module. # # To make our script work as a module (in the intended way), we need to add a check whether the module is imported or not: # ```python # #!/usr/bin/env python # import sys # # def calc_gc_percent(seq): # """ # Calculates the GC percentage of the given sequence. # # Arguments: # - seq - the input sequence (string). # # Returns: # - GC percentage (float). # # The returned value is always <= 100.0 # """ # at_count, gc_count = 0, 0 # # Change input to all caps to allow for non-capital # # input sequence. # for char in seq.upper(): # if char in ('A', 'T'): # at_count += 1 # elif char in ('G', 'C'): # gc_count += 1 # else: # raise ValueError( # "Unexpeced character found: {}. Only " # "ACTGs are allowed.".format(char)) # # # Corner case handling: empty input sequence. # try: # return gc_count * 100.0 / (gc_count + at_count) # except ZeroDivisionError: # return 0.0 # # if __name__ == '__main__': # input_seq = sys.argv[1] # print "The sequence '{}' has %GC of {:.2f}".format( # input_seq, calc_gc_percent(input_seq)) # ``` # Now try importing the module again. What happens? Can you still use the module as a script? # ### Using modules # # When a module is imported, we can access the objects defined in it: # In[1]: import seq_toolbox # In[2]: seq_toolbox.calc_gc_percent # By the way, remember we added docstring to the `calc_gc_percent` function? After importing our module, we can read up on how to use the function in its docstring: # In[3]: get_ipython().run_line_magic('pinfo', 'seq_toolbox.calc_gc_percent') # In[4]: seq_toolbox.calc_gc_percent('ACTG') # We can also expose an object inside the module directly into our current namespace using the `from ... import ...` statement: # In[5]: from seq_toolbox import calc_gc_percent # In[6]: calc_gc_percent('AAAG') # Sometimes, we want to alias the imported object to reduce the chance of it overwriting any already-defined objects with the same name. This is accomplished using the `from ... import ... as ...` statement: # In[7]: from seq_toolbox import calc_gc_percent as gc_calc # In[8]: gc_calc('AAAG') # ### (A simple guide on) How modules are discovered # # In our case, Python imports by checking whether the module exists in the current directory. This is not the only place Python looks, however. # # A complete list of paths where Python looks for modules is available via the `sys` module as `sys.path`. It is composed of (in order): # # 1. The current directory. # 2. The `PYTHONPATH` environment variable. # 3. Installation-dependent defaults. # # ## Examples from the standard library # # > Official Python documentation: [The Python Standard Library](http://docs.python.org/2/library/index.html) # # Just to improve our knowledge, let's go through some of the most often used standard library modules. # ### The standard library: `os` module # # > The Python Standard Library: [15.1. os — Miscellaneous operating system interfaces](http://docs.python.org/2/library/os.html) # # The `os` module provides a portable way of using various operating system-specific functionality. It is a large module, but the one of the most frequently used bits is the file-related functions. # In[9]: import os # In[10]: os.getcwd() # Get current directory. # In[11]: os.environ['PATH'] # Get the value of the environment variable PATH. # In[12]: my_filename = 'input.fastq' # In[13]: os.path.splitext(my_filename) # Split the extension and filename. # In[14]: # Join the current directory and `my_filename` to create a file path. os.path.join(os.getcwd(), my_filename) # In[15]: os.path.exists(my_filename) # Check whether `my_filename` exists or not. # In[16]: os.path.isdir('/home') # Checks whether '/home' is a directory. # In[17]: os.path.isfile('/home') # Checks whether '/home' is a file. # ### The standard library: `sys` module # # > The Python Standard Library: [27.1. sys — System-specific parameters and functions](http://docs.python.org/2/library/sys.html) # # This module has various runtime-related and interpreter-related functions. We've seen two of the most commonly used: `sys.argv` and `sys.path`. # In[18]: import sys # In[19]: sys.path # List of places where Python looks for modules when importing. # In[20]: sys.executable # Path to the current interpreter's executable. # In[21]: sys.version_info # Information about our Python version. # In[22]: sys.version_info.major # It also provide a more granular access. # ### The standard library: `math` module # # > The Python Standard Library: [9.2. math — Mathematical functions](http://docs.python.org/2/library/math.html) # # Useful math-related functions can be found here. Other more comprehensive modules exist (`numpy`, your lesson tomorrow), but nevertheless `math` is still useful. # In[23]: import math # In[24]: math.log(10) # Natural log of 10. # In[25]: math.log(100, 10) # Log base 10 of 100. # In[26]: math.pow(3, 4) # 3 raised to the 4th power. # In[27]: math.sqrt(2) # Square root of 2. # In[28]: math.pi # The value of pi. # ### The standard library: `random` module # # > The Python Standard Library: [9.6. random — Generate pseudo-random numbers](http://docs.python.org/2/library/random.html) # # The `random` module contains useful functions for generating pseudo-random numbers. # In[29]: import random # In[30]: random.random() # Random float x, such that 0.0 <= x < 1.0. # In[31]: random.randint(2, 17) # Random integer between 2 and 17, inclusive. # In[32]: # Random choice of any items in the given list. random.choice(['apple', 'banana', 'grape', 'kiwi', 'orange']) # In[33]: # Random sampling of 3 items from the given list. random.sample(['apple', 'banana', 'grape', 'kiwi', 'orange'], 3) # ### The standard library: `re` module # # > The Python Standard Library: [7.2. re — Regular expression operations](http://docs.python.org/2/library/re.html) # # Regular expression-related functions are in the `re` module. # In[34]: import re # In[35]: my_seq = 'CAGTCAGT' # In[36]: results1 = re.search(r'CA.+CA', my_seq) # In[37]: results1.group(0) # In[38]: results2 = re.search(r'CCC..', my_seq) # In[39]: print results2 # ### The standard library: `argparse` module # # > The Python Standard Library: [15.4. argparse — Parser for command-line options, arguments and sub-commands](http://docs.python.org/2/library/argparse.html) # # Using `sys.argv` is neat for small scripts, but as our script gets larger and more complex, we want to be able to handle complex arguments too. The `argparse` module has handy functionalities for creating command-line scripts. # #### Improving our script with `argparse` # # Open your script/module in a text editor and replace `import sys` with `import argparse`. Remove all lines / blocks referencing `sys.argv` # # Change the `if __name__ == '__main__'` block to be the following: # # ```python # if __name__ == '__main__': # # Create our argument parser object. # parser = argparse.ArgumentParser() # # Add the expected argument. # parser.add_argument('input_seq', type=str, # help="Input sequence") # # Do the actual parsing. # args = parser.parse_args() # # And show the output. # print "The sequence '{}' has %GC of {:.2f}".format( # args.input_seq, # calc_gc_percent(args.input_seq)) # ``` # The code does look a little more verbose, but we get something better in return. # # Go back to the shell and execute your script without any arguments. What happens? # # Try executing the following command in the shell. What happens? # # $ python seq_toolbox.py --help # # We're just getting started on `argparse`. There are other useful bits that we'll see shortly after a small intro on file I/O. # # ## Reading and writing files # # Opening files for reading or writing is done using the `open` function. It is commonly used with two arguments, *name* and *mode*: # # * *name* is the name of the file to open. # * *mode* specifies how the file should be handled. # # These are some of the common file modes: # # * `r`: open file for reading (default). # * `w`: open file for writing. # * `a`: open file for appending content. # In[40]: get_ipython().run_line_magic('pinfo', 'open') # ### Reading files # # Let's go through some ways of reading from a file. # In[41]: fh = open('data/short_file.txt') # `fh` is a file handle object which we can use to retrieve the file contents. One simple way would be to read the whole file contents: # In[42]: fh.read() # Executing `fh.read()` a second time gives an empty string. This is because we have "walked" through the file to its end. # In[43]: fh.read() # We can reset the handle to the beginning of the file again using the `seek()` function. Here, we use 0 as the argument since we want to move the handle to position 0 (beginning of the file): # In[44]: fh.seek(0) # In[45]: fh.read() # In practice, reading the whole file into memory is not always a good idea. It is practical for small files, but not if our file is big (e.g., bigger than our memory). In this case, the alternative is to use the `readline()` function. # In[46]: fh.seek(0) # In[47]: fh.readline() # In[48]: fh.readline() # In[49]: fh.readline() # More common in Python is to use the `for` loop with the file handle itself. Python will automatically iterate over each line. # In[50]: fh.seek(0) # In[51]: for line in fh: print line # We can see that iteration exhausts the handle since we are at the end of the file after the loop. # In[52]: fh.readline() # We can also check the file handle position using the `tell()` function. If `tell()` returns a nonzero number, then we are not at the beginning of the file. # In[53]: fh.tell() # Now that we're done with the file handle, we can call the `close()` method to free up any system resources still being used to keep the file open. After we closed the file, we can not use the file object anymore. # In[54]: fh.close() # In[55]: fh.readline() # ### Writing files # # When writing files, we supply the `w` file mode explicitely: # In[58]: fw = open('data/my_file.txt', 'w') # `fw` is a file handle similar to the `fh` that we've seen previously. It is used only for writing and not reading, however. # In[59]: fw.read() # To write to the file, we use its `write()` method. Remember that Python *does not* add newline characters here (as opposed to when you use the `print` statement), so to move to a new line we have to add `\n` ourselves. # In[60]: fw.write('This is my first line ') # In[61]: fw.write('Still on my first line\n') # In[62]: fw.write('Now on my second line') # As with the `r` mode, we can close the handle when we're done with it. The file can then be reopened with the `r` mode and we can check its contents. # In[63]: fw.close() # In[64]: fr = open('data/my_file.txt') # Remember to use the same file we wrote to. for line in fr: print line fr.close() # And finally, to remove the file, we can use the `remove()` function from the `os` module. # In[65]: os.remove('data/my_file.txt') # ### Be cautious when using file handles # # When reading / writing files, we are interacting with external resources that may or may not behave as expected. For example, we don't always have permission to read / write a file, the file itself may not exist, or we have a completely wrong idea of what's in the file. In situations like these, you are encouraged to use the `try ... finally` block. # # The syntax is similar to `try ... except` that we've seen earlier (in fact they are part of the same block, as we'll see later). Unlike `try ... except`, the `finally` block in `try ... finally` is always executed regardless of any raised exceptions. # # Let's take a look at some examples. First, the not recommended one: # In[66]: f = open('data/short_file.txt') for line in f: print int(line) f.close() print 'We closed our filehandle' # Apart from our erroneous conversion of a line of text to an integer, the exception raised because of that causes the `f.close()` statement to be not executed. At this point we have a stale open file handle. # # Stubbornly trying to do the same thing again, this time we use a `finally` clause: # In[67]: try: f = open('data/short_file.txt') for line in f: print int(line) finally: f.close() print 'We closed our file handle' # As you can see, this way the file handle still got closed. # # Now, an even better way would be to also use the `catch` block, to handle the exception we might get if we try it a third time. # In[68]: try: f = open('data/short_file.txt') for line in f: print int(line) except ValueError: print 'Seems there was a line we could not handle' finally: f.close() print 'We closed our file handle' # ### Intermezzo: `sys.stdout`, `sys.stderr`, and `sys.stdin` # # We've seen that the `sys` module provides some useful runtime functions. Now that we know about file handles, we can use three `sys` objects that are essentially file handles: `sys.stdout`, `sys.stderr`, and `sys.stdin`. # # Together, they provide access to the standard output, standard error, and standard input streams. We can use them appropriately by writing to `sys.stdout` and `sys.stderr`, and reading from `sys.stdin`. # # Unlike regular file handles, you don't need to close them after using (in fact you should not). The assumption is that these handles are always open to write to or to read from. # In[69]: sys.stdout.write("I'm writing to stdout!\n") # In[70]: sys.stderr.write("Now to stderr.\n") # ### Improving our script to allow input from a file # # Before we go on to the exercise, let's do a final improvement on our script/module. # # We want to add some extra functionality: the script should accept as its argument a path to a file containing sequences. It will then compute the GC percentage for each sequence in this file. # # There are at least two things we need to do: # # 1. Change the argument parser so that it deals with a new execution mode. # 2. Add some statements to read from a file. # # Open the script in your text editor, and change the `if __name__ == '__main__'` block to the following: # ```python # if __name__ == '__main__': # # Create our argument parser object. # parser = argparse.ArgumentParser() # # Add argument for the input type. # parser.add_argument( # 'mode', type=str, choices=['file', 'text'], # help='Input type of the script') # # Add argument for the input value. # parser.add_argument( # 'value', type=str, # help='Input value of the script') # # Do the actual parsing. # args = parser.parse_args() # # message = "The sequence '{}' has a %GC of {:.2f}" # # if args.mode == 'file': # try: # f = open(args.value, 'r') # for line in f: # seq = line.strip() # gc = calc_gc_percent(seq) # print message.format(seq, gc) # finally: # f.close() # else: # seq = args.value # gc = calc_gc_percent(seq) # print message.format(seq, gc) # ``` # Note the things we've done here: # # 1. We've added a new argument to our parser to specify the input type. # 2. Correspondingly, we've expanded the our function call to handle both input types. # # Save the script, and try running it. What do you see? Is running # # $ python seq_toolbox.py --help # # helpful to resolve this? # # Try running the script with the following command. What do you see? # # $ python seq_toolbox.py file data/seq.txt # # Feel free to look into `data/seq.txt`. # # ## Assignment: Finding the most common 7-mer in a FASTA file # # ### Your task # # Write a script to print out the most common 7-mer and its GC percentage from all the sequences in `data/records.fa`. You are free to reuse your existing toolbox. # # > The example FASTA file was adapted from: [Genome Biology DNA60 Bioinformatics Challenge](http://genomebiology.com/about/update/DNA60_STEPONE) # # ### Hints # # 1. FASTA files have two types of lines: header lines starting with a `>` character and sequence lines. We are only concerned with the sequence line. # 2. Read the string functions documentation. # 3. Read the documentation for built in functions. # # ### Challenges # # 1. Find out how to change your script so that it can read from `data/challenge.fa.gz` without unzipping the file first (hint: standard library). # # 2. Can you change the parser so that there is an option flag to tell the program whether the input file is gzipped or not? # # 3. Can you change your script so that it works for any N-mers instead of for just 7-mers? # # ## Further reading # # > Python standard library by examples: [Python Module of the Week](http://pymotw.com/2/contents.html) # # > [PEP8: Style Guide for Python Code](http://www.python.org/dev/peps/pep-0008/) # # > [PEP20: The Zen of Python](http://www.python.org/dev/peps/pep-0020/) # In[71]: from IPython.core.display import HTML def custom_style(): style = open('styles/notebook.css', 'r').read() return HTML('') def custom_script(): script = open('styles/notebook.js', 'r').read() return HTML('') # In[72]: custom_style() # In[73]: custom_script() # Acknowledgements # ======== # # [Wibowo Arindrarto](mailto:w.arindrarto@lumc.nl) # # Martijn Vermaat # # [Jeroen Laros](mailto:j.f.j.laros@lumc.nl) # # Based on # --------- # [Python Scientific Lecture Notes](http://scipy-lectures.github.io/) # # License # -------- # [Creative Commons Attribution 3.0 License (CC-by)](http://creativecommons.org/licenses/by/3.0)