Notebook

Biopython¶

biopython logo

A quick overview¶

Sander Bollen ¶

What is Biopython?¶

'Python Tools for Computational Molecular Biology'¶

Fully open-source
Actively developed
Large community

What can it do?¶

Modules, classes and functions for manipulating biological data

File parsers and writers.
- Sequence files: fasta, fastq, genbank, abi, sff, etc.
- Alignment files: clustal, emboss, phylip, nexus, etc.
- Sequence search outputs: BLAST, HMMER, BLAT, etc.
- Phylogenetic trees: newick, nexus, phyloxml, etc.
- Sequence motifs: AlignAce, TRANSFAC, etc.
- Others: PDB files, etc.
Access to remote resources (e.g., Entrez, NCBI BLAST).
Application wrappers.
A simple graphing tool.
Simple algorithms (e.g., pairwise alignment, cluster analysis).
References such as codon tables and IUPAC sequences.

Where can I find more information?¶

Biopython Homepage
Biopython development repository
Biopython mailing list
Biopython 'cookbook' (essential reading!)

Manipulating sequence data¶

Seq and SeqRecord objects¶

Seq and SeqRecord objects are the basis of all sequence manipulation in Biopython.

Seq is a raw sequence with an alphabet (e.g. DNA or RNA).
SeqRecord is a sequence with metadata (e.g. names, ids, etc). This contains a Seq object.

In [1]:

from Bio.Seq import Seq
from Bio.Alphabet import generic_dna

# create a sequence, and store it in a variable

my_sequence = Seq("ATGGCCCTGTGGATGCGCCTCCTGCCCCTG", generic_dna)
print(my_sequence)

ATGGCCCTGTGGATGCGCCTCCTGCCCCTG

There are lots of built in methods that can be used to manipulate the sequence

The sequence acts like a string in many ways

In [2]:

# get the length of a sequence
print("length: {0}".format(len(my_sequence)))

length: 30

In [3]:

# slice and dice
print(my_sequence[:10])

ATGGCCCTGT

In [4]:

# change the case
print(my_sequence.lower())

atggccctgtggatgcgcctcctgcccctg

In [5]:

# concatenate the first and last 10 nucleotides
print(my_sequence[:10] + my_sequence[-10:])

ATGGCCCTGTCCTGCCCCTG

But also has more sequence-specific methods

In [6]:

# complement
print(my_sequence.complement())

TACCGGGACACCTACGCGGAGGACGGGGAC

In [7]:

# reverse complement
print(my_sequence.reverse_complement())

CAGGGGCAGGAGGCGCATCCACAGGGCCAT

In [8]:

# transcribe from DNA to RNA
rna = my_sequence.transcribe()
print(rna)

AUGGCCCUGUGGAUGCGCCUCCUGCCCCUG

In [9]:

# Translate from nucleotide to protein
protein = my_sequence.translate()
print(protein)

MALWMRLLPL

Manipulating Sequence Data¶

Bio.SeqIO¶

Input and output of sequence files.

SeqIO.read
- Read a file containing a single sequence
SeqIO.parse
- Iterate over all sequences in a sequence file
SeqIO.write
- write sequences to a file

In [10]:

from Bio import SeqIO

# read the first sequence
# returns SeqRecord objects
for record in SeqIO.parse("../data/records.fa", "fasta"):
    dna = record
    break

print(dna)

ID: 1
Name: 1
Description: 1
Number of features: 0
Seq('TGGAACATGTCCCGCTAGCTTCTTCTTGCTAGCAGATTTTTTCAGTTGATCGTC...TCT', SingleLetterAlphabet())

Each record is an object with several fields, including:

record.id
- the sequence id
record.name
- sequence name, usually the same as the id
record.description
- sequence description

The actual sequence is a separate object contained within the record which can be accessed using record.seq

The sequence has an 'alphabet' associated with it which defines which letters are allowed.

Different alphabets are used for DNA, RNA, protein etc.

In [11]:

print(dna.seq)

TGGAACATGTCCCGCTAGCTTCTTCTTGCTAGCAGATTTTTTCAGTTGATCGTCACATGCGGTAGACTACCCAAGGTGTGACTACTCGCATGCCTGATCT

In [12]:

# we can then do our sequence manipulations on the `.seq` attribute of the record

print(dna.seq.reverse_complement())

AGATCAGGCATGCGAGTAGTCACACCTTGGGTAGTCTACCGCATGTGACGATCAACTGAAAAAATCTGCTAGCAAGAAGAAGCTAGCGGGACATGTTCCA

Sequence records can easily be written to a file.

Specifying the file type allows conversion between different formats.

For example, to convert from a fastq file to fasta format:

In [13]:

records = SeqIO.parse("../data/easy.fastq", "fastq")
SeqIO.write(records, "tmp.fasta", "fasta")

Out[13]:

Sequence alignment¶

It is possible align sequences using biopython with various methods. Some of these depend on external tools (e.g. clustalw), but simple pair-wise alignment is supported out of the box.

In [14]:

from Bio.pairwise2 import align, format_alignment

# load fasta with insulin for several species as a handle
ins_handle = SeqIO.parse("../data/ins.fa", "fasta")

In [15]:

# make a list of records 
ins_records = []
for item in ins_handle:
    ins_records.append(item)

In [16]:

# extract a little of the sequence for human and chimp

human_ins_bit = ins_records[0][-45:]
chimp_ins_bit = ins_records[1][-45:]

In [17]:

# get the alignments with smith-waterman, without gap penalties
alignments = align.localxx(human_ins_bit.seq, chimp_ins_bit.seq)


# print the best alignment 
best = alignments[0]
print(format_alignment(*best))

CTCCTGC-A---C--C-G-AGAG--AGATGGAATAAAGCCCTTGAACCAGCAAAA
   |||  |   |  | | | |   ||||||||||||||||||||||||||
---CTG-GAGAACTACTGCA-A-CTAGATGGAATAAAGCCCTTGAACCAGC----
  Score=35

In [18]:

# get alignments with specified scores and penalties:
# 2 for match, 4 for mismatch, 
#-2 for gap open, -0.5 for gap extend
gap_alignments = align.localms(human_ins_bit.seq, 
                               chimp_ins_bit.seq, 
                               2, -4, -2, -0.5)

print(format_alignment(*gap_alignments[0]))

CTCCTGCACCGAGA------G-----AGATGGAATAAAGCCCTTGAACCAGCAAAA
   |||    ||||      |     ||||||||||||||||||||||||||
---CTG----GAGAACTACTGCAACTAGATGGAATAAAGCCCTTGAACCAGC----
  Score=56

Motis¶

We can also get a consensus sequence given multiple sequences, and visualize the result.

In [19]:

motif_handle = SeqIO.parse("../data/motif.fa", "fasta")
motif_records = []
for item in motif_handle:
    motif_records.append(item.seq[:9].upper())

In [20]:

from Bio import motifs

# create a motif object
motif = motifs.create(motif_records)

In [21]:

print(motif.counts)

        0      1      2      3      4      5      6      7      8
A:  30.00   0.00  17.00  27.00   0.00  25.00  26.00   0.00  19.00
C:  52.00  50.00   0.00  57.00  50.00   0.00  58.00  50.00   0.00
G:   0.00  50.00  61.00   0.00  50.00  58.00   0.00  50.00  53.00
T:  18.00   0.00  22.00  16.00   0.00  17.00  16.00   0.00  28.00

In [22]:

print(motif.consensus)

CGGCGGCGG

In [23]:

motif.weblogo("tmp.svg", format="SVG")

Remote files¶

NCBI allow for remote querying of their Entrez database, and Biopython allows us to use their services from within python.

We can use the Entrez.efetch utility to retrieve various records from one of NCBI's databases.

A full list of these services and their documentation can be found on the Entrez utilities help page

In [24]:

from Bio import Entrez

IMPORTANT:

To monitor potential excessive use of their services, NCBI requests you to specify your email address with each request.

With Biopython, you can set it once for your session like this:

In [25]:

Entrez.email = 'python@lumc.nl'

Now we can make a query of the database.

The Entrez.efetch function returns a file-like handle that instead of pointing to a local file, points to a remote resource.

In [26]:

efetch_handle = Entrez.efetch(db="nucleotide", id="NM_005804",
                              rettype="gb", retmode="text")

We can use the handle as if it were a normal file handle opened with open("filename", "r"), and read from it using SeqIO.read()

In [27]:

ncbi_record = SeqIO.read(efetch_handle, 'genbank')

print(ncbi_record)

ID: NM_005804.3
Name: NM_005804
Description: Homo sapiens DExD-box helicase 39A (DDX39A), transcript variant 1, mRNA
Number of features: 25
/molecule_type=mRNA
/topology=linear
/data_file_division=PRI
/date=20-OCT-2018
/accessions=['NM_005804']
/sequence_version=3
/keywords=['RefSeq']
/source=Homo sapiens (human)
/organism=Homo sapiens
/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Catarrhini', 'Hominidae', 'Homo']
/references=[Reference(title='The RNA helicase DDX39B and its paralog DDX39A regulate androgen receptor splice variant AR-V7 generation', ...), Reference(title='Identification of DDX39A as a Potential Biomarker for Unfavorable Neuroblastoma Using a Proteomic Approach', ...), Reference(title='Up-regulation of DDX39 in human malignant pleural mesothelioma cell lines compared to normal pleural mesothelial cells', ...), Reference(title='The mRNA-bound proteome and its global occupancy profile on protein-coding transcripts', ...), Reference(title='Clinical proteomics identified ATP-dependent RNA helicase DDX39 as a novel biomarker to predict poor prognosis of patients with gastrointestinal stromal tumor', ...), Reference(title='The closely related RNA helicases, UAP56 and URH49, preferentially form distinct mRNA export machineries and coordinately regulate mitotic progression', ...), Reference(title='Hcc-1 is a novel component of the nuclear matrix with growth inhibitory function', ...), Reference(title='Growth-regulated expression and G0-specific turnover of the mRNA that encodes URH49, a mammalian DExH/D box protein that is highly related to the mRNA export protein UAP56', ...), Reference(title='Analysis of a high-throughput yeast two-hybrid system and its use to predict the function of intracellular proteins encoded within the human MHC class III region', ...), Reference(title='The BAT1 gene in the MHC encodes an evolutionarily conserved putative nuclear RNA helicase of the DEAD family', ...)]
/comment=REVIEWED REFSEQ: This record has been curated by NCBI staff. The
reference sequence was derived from DA432925.1, BC001009.2 and
BM792110.1.
This sequence is a reference standard in the RefSeqGene project.
On Oct 14, 2010 this sequence version replaced NM_005804.2.
Summary: This gene encodes a member of the DEAD box protein family.
These proteins are characterized by the conserved motif
Asp-Glu-Ala-Asp (DEAD) and are putative RNA helicases. They are
implicated in a number of cellular processes involving alteration
of RNA secondary structure, such as translation initiation, nuclear
and mitochondrial splicing, and ribosome and spliceosome assembly.
Based on their distribution patterns, some members of the DEAD box
protein family are believed to be involved in embryogenesis,
spermatogenesis, and cellular growth and division. This gene is
thought to play a role in the prognosis of patients with
gastrointestinal stromal tumors. A pseudogene of this gene is
present on chromosome 13. Alternate splicing results in multiple
transcript variants. Additional alternatively spliced transcript
variants of this gene have been described, but their full-length
nature is not known. [provided by RefSeq, Sep 2013].
Transcript Variant: This variant (1) represents the longer
transcript.
Publication Note:  This RefSeq record includes a subset of the
publications that are available for this gene. Please see the Gene
record to access additional publications.
                               SRR1163655.274234.1 [ECO:0000332]
                               SAMEA1965299, SAMEA1966682
                               [ECO:0000350]
COMPLETENESS: complete on the 3' end.
/structured_comment=OrderedDict([('Evidence-Data', OrderedDict([('Transcript exon combination', 'SRR1163655.176131.1,'), ('RNAseq introns', 'mixed/partial sample support')]))])
Seq('AGCAGCAGCCCGACGCAAGAGGCAGGAAGCGCAGCAACTCGTGTCTGAGCGCCC...AAA', IUPACAmbiguousDNA())

It is also possible to query for multiple records

In [28]:

efetch_handle = Entrez.efetch(db="nucleotide", id=["NM_005804","NM_000967"],
                              rettype="gb", retmode="text")

Which can then be iterated over using SeqIO.parse

In [29]:

for record in SeqIO.parse(efetch_handle, 'genbank'):
    print(record.id, record.description)

NM_005804.3 Homo sapiens DExD-box helicase 39A (DDX39A), transcript variant 1, mRNA
NM_000967.3 Homo sapiens ribosomal protein L3 (RPL3), transcript variant 1, mRNA

Remote Tools¶

It is possible to use Biopyton with remote tools.

For example, we can submit a BLAST search to the NCBI service. (Documentation here)

We will use qblast function in the Bio.Blast.NCBIWWW module to perform a BLAST search using the record we retrieved earlier.

NOTE: It can take some time for the search results to become available

In [30]:

from Bio.Blast.NCBIWWW import qblast
blast_handle = qblast('blastn', 'nt', ncbi_record.seq)

print(ncbi_record.seq)

AGCAGCAGCCCGACGCAAGAGGCAGGAAGCGCAGCAACTCGTGTCTGAGCGCCCGGCGGAAAACCGAAGTTGGAAGTGTCTCTTAGCAGCGCGCGGAGAAGAACGGGGAGCCAGCATCATGGCAGAACAGGATGTGGAAAACGATCTTTTGGATTACGATGAAGAGGAAGAGCCCCAGGCTCCTCAAGAGAGCACACCAGCTCCCCCTAAGAAAGACATCAAGGGATCCTACGTTTCCATCCACAGCTCTGGCTTCCGGGACTTTCTGCTGAAGCCGGAGCTCCTGCGGGCCATCGTGGACTGTGGCTTTGAGCATCCTTCTGAGGTCCAGCATGAGTGCATTCCCCAGGCCATCCTGGGCATGGACGTCCTGTGCCAGGCCAAGTCCGGGATGGGCAAGACAGCGGTCTTCGTGCTGGCCACCCTACAGCAGATTGAGCCTGTCAACGGACAGGTGACGGTCCTGGTCATGTGCCACACGAGGGAGCTGGCCTTCCAGATCAGCAAGGAATATGAGCGCTTTTCCAAGTACATGCCCAGCGTCAAGGTGTCTGTGTTCTTCGGTGGTCTCTCCATCAAGAAGGATGAAGAAGTGTTGAAGAAGAACTGTCCCCATGTCGTGGTGGGGACCCCGGGCCGCATCCTGGCGCTCGTGCGGAATAGGAGCTTCAGCCTAAAGAATGTGAAGCACTTTGTGCTGGACGAGTGTGACAAGATGCTGGAGCAGCTGGACATGCGGCGGGATGTGCAGGAGATCTTCCGCCTGACACCACACGAGAAGCAGTGCATGATGTTCAGCGCCACCCTGAGCAAGGACATCCGGCCTGTGTGCAGGAAGTTCATGCAGGATCCCATGGAGGTGTTTGTGGACGACGAGACCAAGCTCACGCTGCACGGCCTGCAGCAGTACTACGTCAAACTCAAAGACAGTGAGAAGAACCGCAAGCTCTTTGATCTCTTGGATGTGCTGGAGTTTAACCAGGTGATAATCTTCGTCAAGTCAGTGCAGCGCTGCATGGCCCTGGCCCAGCTCCTCGTGGAGCAGAACTTCCCGGCCATCGCCATCCACCGGGGCATGGCCCAGGAGGAGCGCCTGTCACGCTATCAGCAGTTCAAGGATTTCCAGCGGCGGATCCTGGTGGCCACCAATCTGTTTGGCCGGGGGATGGACATCGAGCGAGTCAACATCGTCTTTAACTACGACATGCCTGAGGACTCGGACACCTACCTGCACCGGGTGGCCCGGGCGGGTCGCTTTGGCACCAAAGGCCTAGCCATCACTTTTGTGTCTGACGAGAATGATGCCAAAATCCTCAATGACGTCCAGGACCGGTTTGAAGTTAATGTGGCAGAACTTCCAGAGGAAATCGACATCTCCACATACATCGAGCAGAGCCGGTAACCACCACGTGCCAGAGCCGCCCACCCGGAGCCGCCCGCATGCAGCTTCACCTCCCCTTTCCAGGCGCCACTGTTGAGAAGCTAGAGATTGTATGAGAATAAACTTGTTATTATGGAAGCCTGGCTCCCACCCCATCTAAAAAAAAAAAAAAAAAAA

We can the read from the file handle using the Bio.SearchIO module.

In [31]:

from Bio import SearchIO
qresult = SearchIO.read(blast_handle, 'blast-xml')
print(qresult)

Program: blastn (2.8.1+)
  Query: No (1558)
         definition line
 Target: nt
   Hits: ----  -----  ----------------------------------------------------------
            #  # HSP  ID + description
         ----  -----  ----------------------------------------------------------
            0      1  gi|308522777|ref|NM_005804.3|  Homo sapiens DExD-box he...
            1      1  gi|1367219251|ref|XM_016935303.2|  PREDICTED: Pan trogl...
            2      1  gi|675689963|ref|XM_003807080.2|  PREDICTED: Pan panisc...
            3      1  gi|1099186172|ref|XM_004060164.2|  PREDICTED: Gorilla g...
            4      1  gi|1351474314|ref|XM_002828787.4|  PREDICTED: Pongo abe...
            5      1  gi|1905997|gb|U90426.1|HSU90426  Human nuclear RNA heli...
            6      1  gi|33875869|gb|BC001009.2|  Homo sapiens DEAD (Asp-Glu-...
            7      1  gi|10439504|dbj|AK026614.1|  Homo sapiens cDNA: FLJ2296...
            8      1  gi|795239725|ref|XM_011953207.1|  PREDICTED: Colobus an...
            9      1  gi|1411128774|ref|XM_025367317.1|  PREDICTED: Theropith...
           10      1  gi|1059109912|ref|XM_017851234.1|  PREDICTED: Rhinopith...
           11      1  gi|724815869|ref|XM_010361572.1|  PREDICTED: Rhinopithe...
           12      1  gi|1220191829|ref|XM_009193704.2|  PREDICTED: Papio anu...
           13      1  gi|982311930|ref|XM_005588244.2|  PREDICTED: Macaca fas...
           14      1  gi|1297694799|ref|XM_023229571.1|  PREDICTED: Piliocolo...
           15      1  gi|967496221|ref|XM_015123082.1|  PREDICTED: Macaca mul...
           16      1  gi|768000518|ref|XM_011527620.1|  PREDICTED: Homo sapie...
           17      1  gi|635036575|ref|XM_007995524.1|  PREDICTED: Chlorocebu...
           18      1  gi|795271240|ref|XM_011768647.1|  PREDICTED: Macaca nem...
           19      1  gi|795144436|ref|XM_011981779.1|  PREDICTED: Mandrillus...
           20      1  gi|194377853|dbj|AK301847.1|  Homo sapiens cDNA FLJ5548...
           21      1  gi|795433285|ref|XM_012094155.1|  PREDICTED: Cercocebus...
           22      1  gi|1367219254|ref|XM_016935304.2|  PREDICTED: Pan trogl...
           23      1  gi|795433280|ref|XM_012094154.1|  PREDICTED: Cercocebus...
           24      1  gi|1297694797|ref|XM_023229570.1|  PREDICTED: Piliocolo...
           25      1  gi|795239720|ref|XM_011953206.1|  PREDICTED: Colobus an...
           26      1  gi|1059109914|ref|XM_017851235.1|  PREDICTED: Rhinopith...
           27      1  gi|1220191830|ref|XM_021930788.1|  PREDICTED: Papio anu...
           28      1  gi|685606530|ref|XM_009193706.1|  PREDICTED: Papio anub...
           29      1  gi|1220191832|ref|XM_017952263.2|  PREDICTED: Papio anu...
           ~~~
           47      1  gi|1044402864|ref|XM_017497619.1|  PREDICTED: Cebus cap...
           48      1  gi|1044402866|ref|XM_017497620.1|  PREDICTED: Cebus cap...
           49      1  gi|1044402868|ref|XM_017497621.1|  PREDICTED: Cebus cap...

In [32]:

print(qresult[0])

Query: No
       definition line
  Hit: gi|308522777|ref|NM_005804.3| (1558)
       Homo sapiens DExD-box helicase 39A (DDX39A), transcript variant 1, mRNA
 HSPs: ----  --------  ---------  ------  ---------------  ---------------------
          #   E-value  Bit score    Span      Query range              Hit range
       ----  --------  ---------  ------  ---------------  ---------------------
          0         0    2810.93    1558         [0:1558]               [0:1558]

In [33]:

print(qresult[1])

Query: No
       definition line
  Hit: gi|1367219251|ref|XM_016935303.2| (1530)
       PREDICTED: Pan troglodytes DExD-box helicase 39A (DDX39A), transcript...
 HSPs: ----  --------  ---------  ------  ---------------  ---------------------
          #   E-value  Bit score    Span      Query range              Hit range
       ----  --------  ---------  ------  ---------------  ---------------------
          0         0    2695.52    1519         [0:1519]              [11:1530]

That was just an overview¶

This lesson was just a small taste of what can be done with Biopython.

I strongly recommend looking at the Biopython 'cookbook' to get an idea of the wide range of things that you can do with it.

The lesson was based on previous material by Guy Allard, Wibowo Arindrarto and Martijn Vermaat.

License: Creative Commons Attribution 3.0 License (CC-by)

In [ ]: