Modules, classes and functions for manipulating biological data
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna
# create a sequence, and store it in a variable
my_sequence = Seq("ATGGCCCTGTGGATGCGCCTCCTGCCCCTG", generic_dna)
print(my_sequence)
ATGGCCCTGTGGATGCGCCTCCTGCCCCTG
There are lots of built in methods that can be used to manipulate the sequence
The sequence acts like a string in many ways
# get the length of a sequence
print("length: {0}".format(len(my_sequence)))
length: 30
# slice and dice
print(my_sequence[:10])
ATGGCCCTGT
# change the case
print(my_sequence.lower())
atggccctgtggatgcgcctcctgcccctg
# concatenate the first and last 10 nucleotides
print(my_sequence[:10] + my_sequence[-10:])
ATGGCCCTGTCCTGCCCCTG
But also has more sequence-specific methods
# complement
print(my_sequence.complement())
TACCGGGACACCTACGCGGAGGACGGGGAC
# reverse complement
print(my_sequence.reverse_complement())
CAGGGGCAGGAGGCGCATCCACAGGGCCAT
# transcribe from DNA to RNA
rna = my_sequence.transcribe()
print(rna)
AUGGCCCUGUGGAUGCGCCUCCUGCCCCUG
# Translate from nucleotide to protein
protein = my_sequence.translate()
print(protein)
MALWMRLLPL
from Bio import SeqIO
# read the first sequence
# returns SeqRecord objects
for record in SeqIO.parse("../data/records.fa", "fasta"):
dna = record
break
print(dna)
ID: 1 Name: 1 Description: 1 Number of features: 0 Seq('TGGAACATGTCCCGCTAGCTTCTTCTTGCTAGCAGATTTTTTCAGTTGATCGTC...TCT', SingleLetterAlphabet())
Each record is an object with several fields, including:
record.id
record.name
record.description
The actual sequence is a separate object contained within the record which can be accessed using record.seq
The sequence has an 'alphabet' associated with it which defines which letters are allowed.
Different alphabets are used for DNA, RNA, protein etc.
print(dna.seq)
TGGAACATGTCCCGCTAGCTTCTTCTTGCTAGCAGATTTTTTCAGTTGATCGTCACATGCGGTAGACTACCCAAGGTGTGACTACTCGCATGCCTGATCT
# we can then do our sequence manipulations on the `.seq` attribute of the record
print(dna.seq.reverse_complement())
AGATCAGGCATGCGAGTAGTCACACCTTGGGTAGTCTACCGCATGTGACGATCAACTGAAAAAATCTGCTAGCAAGAAGAAGCTAGCGGGACATGTTCCA
Sequence records can easily be written to a file.
Specifying the file type allows conversion between different formats.
For example, to convert from a fastq file to fasta format:
records = SeqIO.parse("../data/easy.fastq", "fastq")
SeqIO.write(records, "tmp.fasta", "fasta")
1
It is possible align sequences using biopython with various methods.
Some of these depend on external tools (e.g. clustalw
), but simple pair-wise alignment is supported out of the box.
from Bio.pairwise2 import align, format_alignment
# load fasta with insulin for several species as a handle
ins_handle = SeqIO.parse("../data/ins.fa", "fasta")
# make a list of records
ins_records = []
for item in ins_handle:
ins_records.append(item)
# extract a little of the sequence for human and chimp
human_ins_bit = ins_records[0][-45:]
chimp_ins_bit = ins_records[1][-45:]
# get the alignments with smith-waterman, without gap penalties
alignments = align.localxx(human_ins_bit.seq, chimp_ins_bit.seq)
# print the best alignment
best = alignments[0]
print(format_alignment(*best))
CTCCTGC-A---C--C-G-AGAG--AGATGGAATAAAGCCCTTGAACCAGCAAAA ||| | | | | | | |||||||||||||||||||||||||| ---CTG-GAGAACTACTGCA-A-CTAGATGGAATAAAGCCCTTGAACCAGC---- Score=35
# get alignments with specified scores and penalties:
# 2 for match, 4 for mismatch,
#-2 for gap open, -0.5 for gap extend
gap_alignments = align.localms(human_ins_bit.seq,
chimp_ins_bit.seq,
2, -4, -2, -0.5)
print(format_alignment(*gap_alignments[0]))
CTCCTGCACCGAGA------G-----AGATGGAATAAAGCCCTTGAACCAGCAAAA ||| |||| | |||||||||||||||||||||||||| ---CTG----GAGAACTACTGCAACTAGATGGAATAAAGCCCTTGAACCAGC---- Score=56
We can also get a consensus sequence given multiple sequences, and visualize the result.
motif_handle = SeqIO.parse("../data/motif.fa", "fasta")
motif_records = []
for item in motif_handle:
motif_records.append(item.seq[:9].upper())
from Bio import motifs
# create a motif object
motif = motifs.create(motif_records)
print(motif.counts)
0 1 2 3 4 5 6 7 8 A: 30.00 0.00 17.00 27.00 0.00 25.00 26.00 0.00 19.00 C: 52.00 50.00 0.00 57.00 50.00 0.00 58.00 50.00 0.00 G: 0.00 50.00 61.00 0.00 50.00 58.00 0.00 50.00 53.00 T: 18.00 0.00 22.00 16.00 0.00 17.00 16.00 0.00 28.00
print(motif.consensus)
CGGCGGCGG
motif.weblogo("tmp.svg", format="SVG")
NCBI allow for remote querying of their Entrez database, and Biopython allows us to use their services from within python.
We can use the Entrez.efetch utility to retrieve various records from one of NCBI's databases.
A full list of these services and their documentation can be found on the Entrez utilities help page
from Bio import Entrez
IMPORTANT:
To monitor potential excessive use of their services, NCBI requests you to specify your email address with each request.
With Biopython, you can set it once for your session like this:
Entrez.email = 'python@lumc.nl'
Now we can make a query of the database.
The Entrez.efetch function returns a file-like handle that instead of pointing to a local file, points to a remote resource.
efetch_handle = Entrez.efetch(db="nucleotide", id="NM_005804",
rettype="gb", retmode="text")
We can use the handle as if it were a normal file handle opened with open("filename", "r")
, and read from it using SeqIO.read()
ncbi_record = SeqIO.read(efetch_handle, 'genbank')
print(ncbi_record)
ID: NM_005804.3 Name: NM_005804 Description: Homo sapiens DExD-box helicase 39A (DDX39A), transcript variant 1, mRNA Number of features: 25 /molecule_type=mRNA /topology=linear /data_file_division=PRI /date=20-OCT-2018 /accessions=['NM_005804'] /sequence_version=3 /keywords=['RefSeq'] /source=Homo sapiens (human) /organism=Homo sapiens /taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Catarrhini', 'Hominidae', 'Homo'] /references=[Reference(title='The RNA helicase DDX39B and its paralog DDX39A regulate androgen receptor splice variant AR-V7 generation', ...), Reference(title='Identification of DDX39A as a Potential Biomarker for Unfavorable Neuroblastoma Using a Proteomic Approach', ...), Reference(title='Up-regulation of DDX39 in human malignant pleural mesothelioma cell lines compared to normal pleural mesothelial cells', ...), Reference(title='The mRNA-bound proteome and its global occupancy profile on protein-coding transcripts', ...), Reference(title='Clinical proteomics identified ATP-dependent RNA helicase DDX39 as a novel biomarker to predict poor prognosis of patients with gastrointestinal stromal tumor', ...), Reference(title='The closely related RNA helicases, UAP56 and URH49, preferentially form distinct mRNA export machineries and coordinately regulate mitotic progression', ...), Reference(title='Hcc-1 is a novel component of the nuclear matrix with growth inhibitory function', ...), Reference(title='Growth-regulated expression and G0-specific turnover of the mRNA that encodes URH49, a mammalian DExH/D box protein that is highly related to the mRNA export protein UAP56', ...), Reference(title='Analysis of a high-throughput yeast two-hybrid system and its use to predict the function of intracellular proteins encoded within the human MHC class III region', ...), Reference(title='The BAT1 gene in the MHC encodes an evolutionarily conserved putative nuclear RNA helicase of the DEAD family', ...)] /comment=REVIEWED REFSEQ: This record has been curated by NCBI staff. The reference sequence was derived from DA432925.1, BC001009.2 and BM792110.1. This sequence is a reference standard in the RefSeqGene project. On Oct 14, 2010 this sequence version replaced NM_005804.2. Summary: This gene encodes a member of the DEAD box protein family. These proteins are characterized by the conserved motif Asp-Glu-Ala-Asp (DEAD) and are putative RNA helicases. They are implicated in a number of cellular processes involving alteration of RNA secondary structure, such as translation initiation, nuclear and mitochondrial splicing, and ribosome and spliceosome assembly. Based on their distribution patterns, some members of the DEAD box protein family are believed to be involved in embryogenesis, spermatogenesis, and cellular growth and division. This gene is thought to play a role in the prognosis of patients with gastrointestinal stromal tumors. A pseudogene of this gene is present on chromosome 13. Alternate splicing results in multiple transcript variants. Additional alternatively spliced transcript variants of this gene have been described, but their full-length nature is not known. [provided by RefSeq, Sep 2013]. Transcript Variant: This variant (1) represents the longer transcript. Publication Note: This RefSeq record includes a subset of the publications that are available for this gene. Please see the Gene record to access additional publications. SRR1163655.274234.1 [ECO:0000332] SAMEA1965299, SAMEA1966682 [ECO:0000350] COMPLETENESS: complete on the 3' end. /structured_comment=OrderedDict([('Evidence-Data', OrderedDict([('Transcript exon combination', 'SRR1163655.176131.1,'), ('RNAseq introns', 'mixed/partial sample support')]))]) Seq('AGCAGCAGCCCGACGCAAGAGGCAGGAAGCGCAGCAACTCGTGTCTGAGCGCCC...AAA', IUPACAmbiguousDNA())
It is also possible to query for multiple records
efetch_handle = Entrez.efetch(db="nucleotide", id=["NM_005804","NM_000967"],
rettype="gb", retmode="text")
Which can then be iterated over using SeqIO.parse
for record in SeqIO.parse(efetch_handle, 'genbank'):
print(record.id, record.description)
NM_005804.3 Homo sapiens DExD-box helicase 39A (DDX39A), transcript variant 1, mRNA NM_000967.3 Homo sapiens ribosomal protein L3 (RPL3), transcript variant 1, mRNA
It is possible to use Biopyton with remote tools.
For example, we can submit a BLAST search to the NCBI service. (Documentation here)
We will use qblast function in the Bio.Blast.NCBIWWW module to perform a BLAST search using the record we retrieved earlier.
NOTE: It can take some time for the search results to become available
from Bio.Blast.NCBIWWW import qblast
blast_handle = qblast('blastn', 'nt', ncbi_record.seq)
print(ncbi_record.seq)
AGCAGCAGCCCGACGCAAGAGGCAGGAAGCGCAGCAACTCGTGTCTGAGCGCCCGGCGGAAAACCGAAGTTGGAAGTGTCTCTTAGCAGCGCGCGGAGAAGAACGGGGAGCCAGCATCATGGCAGAACAGGATGTGGAAAACGATCTTTTGGATTACGATGAAGAGGAAGAGCCCCAGGCTCCTCAAGAGAGCACACCAGCTCCCCCTAAGAAAGACATCAAGGGATCCTACGTTTCCATCCACAGCTCTGGCTTCCGGGACTTTCTGCTGAAGCCGGAGCTCCTGCGGGCCATCGTGGACTGTGGCTTTGAGCATCCTTCTGAGGTCCAGCATGAGTGCATTCCCCAGGCCATCCTGGGCATGGACGTCCTGTGCCAGGCCAAGTCCGGGATGGGCAAGACAGCGGTCTTCGTGCTGGCCACCCTACAGCAGATTGAGCCTGTCAACGGACAGGTGACGGTCCTGGTCATGTGCCACACGAGGGAGCTGGCCTTCCAGATCAGCAAGGAATATGAGCGCTTTTCCAAGTACATGCCCAGCGTCAAGGTGTCTGTGTTCTTCGGTGGTCTCTCCATCAAGAAGGATGAAGAAGTGTTGAAGAAGAACTGTCCCCATGTCGTGGTGGGGACCCCGGGCCGCATCCTGGCGCTCGTGCGGAATAGGAGCTTCAGCCTAAAGAATGTGAAGCACTTTGTGCTGGACGAGTGTGACAAGATGCTGGAGCAGCTGGACATGCGGCGGGATGTGCAGGAGATCTTCCGCCTGACACCACACGAGAAGCAGTGCATGATGTTCAGCGCCACCCTGAGCAAGGACATCCGGCCTGTGTGCAGGAAGTTCATGCAGGATCCCATGGAGGTGTTTGTGGACGACGAGACCAAGCTCACGCTGCACGGCCTGCAGCAGTACTACGTCAAACTCAAAGACAGTGAGAAGAACCGCAAGCTCTTTGATCTCTTGGATGTGCTGGAGTTTAACCAGGTGATAATCTTCGTCAAGTCAGTGCAGCGCTGCATGGCCCTGGCCCAGCTCCTCGTGGAGCAGAACTTCCCGGCCATCGCCATCCACCGGGGCATGGCCCAGGAGGAGCGCCTGTCACGCTATCAGCAGTTCAAGGATTTCCAGCGGCGGATCCTGGTGGCCACCAATCTGTTTGGCCGGGGGATGGACATCGAGCGAGTCAACATCGTCTTTAACTACGACATGCCTGAGGACTCGGACACCTACCTGCACCGGGTGGCCCGGGCGGGTCGCTTTGGCACCAAAGGCCTAGCCATCACTTTTGTGTCTGACGAGAATGATGCCAAAATCCTCAATGACGTCCAGGACCGGTTTGAAGTTAATGTGGCAGAACTTCCAGAGGAAATCGACATCTCCACATACATCGAGCAGAGCCGGTAACCACCACGTGCCAGAGCCGCCCACCCGGAGCCGCCCGCATGCAGCTTCACCTCCCCTTTCCAGGCGCCACTGTTGAGAAGCTAGAGATTGTATGAGAATAAACTTGTTATTATGGAAGCCTGGCTCCCACCCCATCTAAAAAAAAAAAAAAAAAAA
We can the read from the file handle using the Bio.SearchIO
module.
from Bio import SearchIO
qresult = SearchIO.read(blast_handle, 'blast-xml')
print(qresult)
Program: blastn (2.8.1+) Query: No (1558) definition line Target: nt Hits: ---- ----- ---------------------------------------------------------- # # HSP ID + description ---- ----- ---------------------------------------------------------- 0 1 gi|308522777|ref|NM_005804.3| Homo sapiens DExD-box he... 1 1 gi|1367219251|ref|XM_016935303.2| PREDICTED: Pan trogl... 2 1 gi|675689963|ref|XM_003807080.2| PREDICTED: Pan panisc... 3 1 gi|1099186172|ref|XM_004060164.2| PREDICTED: Gorilla g... 4 1 gi|1351474314|ref|XM_002828787.4| PREDICTED: Pongo abe... 5 1 gi|1905997|gb|U90426.1|HSU90426 Human nuclear RNA heli... 6 1 gi|33875869|gb|BC001009.2| Homo sapiens DEAD (Asp-Glu-... 7 1 gi|10439504|dbj|AK026614.1| Homo sapiens cDNA: FLJ2296... 8 1 gi|795239725|ref|XM_011953207.1| PREDICTED: Colobus an... 9 1 gi|1411128774|ref|XM_025367317.1| PREDICTED: Theropith... 10 1 gi|1059109912|ref|XM_017851234.1| PREDICTED: Rhinopith... 11 1 gi|724815869|ref|XM_010361572.1| PREDICTED: Rhinopithe... 12 1 gi|1220191829|ref|XM_009193704.2| PREDICTED: Papio anu... 13 1 gi|982311930|ref|XM_005588244.2| PREDICTED: Macaca fas... 14 1 gi|1297694799|ref|XM_023229571.1| PREDICTED: Piliocolo... 15 1 gi|967496221|ref|XM_015123082.1| PREDICTED: Macaca mul... 16 1 gi|768000518|ref|XM_011527620.1| PREDICTED: Homo sapie... 17 1 gi|635036575|ref|XM_007995524.1| PREDICTED: Chlorocebu... 18 1 gi|795271240|ref|XM_011768647.1| PREDICTED: Macaca nem... 19 1 gi|795144436|ref|XM_011981779.1| PREDICTED: Mandrillus... 20 1 gi|194377853|dbj|AK301847.1| Homo sapiens cDNA FLJ5548... 21 1 gi|795433285|ref|XM_012094155.1| PREDICTED: Cercocebus... 22 1 gi|1367219254|ref|XM_016935304.2| PREDICTED: Pan trogl... 23 1 gi|795433280|ref|XM_012094154.1| PREDICTED: Cercocebus... 24 1 gi|1297694797|ref|XM_023229570.1| PREDICTED: Piliocolo... 25 1 gi|795239720|ref|XM_011953206.1| PREDICTED: Colobus an... 26 1 gi|1059109914|ref|XM_017851235.1| PREDICTED: Rhinopith... 27 1 gi|1220191830|ref|XM_021930788.1| PREDICTED: Papio anu... 28 1 gi|685606530|ref|XM_009193706.1| PREDICTED: Papio anub... 29 1 gi|1220191832|ref|XM_017952263.2| PREDICTED: Papio anu... ~~~ 47 1 gi|1044402864|ref|XM_017497619.1| PREDICTED: Cebus cap... 48 1 gi|1044402866|ref|XM_017497620.1| PREDICTED: Cebus cap... 49 1 gi|1044402868|ref|XM_017497621.1| PREDICTED: Cebus cap...
print(qresult[0])
Query: No definition line Hit: gi|308522777|ref|NM_005804.3| (1558) Homo sapiens DExD-box helicase 39A (DDX39A), transcript variant 1, mRNA HSPs: ---- -------- --------- ------ --------------- --------------------- # E-value Bit score Span Query range Hit range ---- -------- --------- ------ --------------- --------------------- 0 0 2810.93 1558 [0:1558] [0:1558]
print(qresult[1])
Query: No definition line Hit: gi|1367219251|ref|XM_016935303.2| (1530) PREDICTED: Pan troglodytes DExD-box helicase 39A (DDX39A), transcript... HSPs: ---- -------- --------- ------ --------------- --------------------- # E-value Bit score Span Query range Hit range ---- -------- --------- ------ --------------- --------------------- 0 0 2695.52 1519 [0:1519] [11:1530]
This lesson was just a small taste of what can be done with Biopython.
I strongly recommend looking at the Biopython 'cookbook' to get an idea of the wide range of things that you can do with it.
The lesson was based on previous material by Guy Allard, Wibowo Arindrarto and Martijn Vermaat.