Source of the materials: Biopython cookbook (Adapted)
New status: DraftRunning BLAST over the Internet
Parsing plain-text BLAST output
Hey, everybody loves BLAST right? I mean, geez, how can it get any easier to do comparisons between one of your sequences and every other sequence in the known world? But, of course, this section isn’t about how cool BLAST is, since we already know that. It is about the problem with BLAST – it can be really difficult to deal with the volume of data generated by large runs, and to automate BLAST runs in general.
Fortunately, the Biopython folks know this only too well, so they’ve developed lots of tools for dealing with BLAST and making things much easier. This section details how to use these tools and do useful things with them.
Dealing with BLAST can be split up into two steps, both of which can be done from within Biopython. Firstly, running BLAST for your query sequence(s), and getting some output. Secondly, parsing the BLAST output in Python for further analysis.
Your first introduction to running BLAST was probably via the NCBI web-service. In fact, there are lots of ways you can run BLAST, which can be categorised in several ways. The most important distinction is running BLAST locally (on your own machine), and running BLAST remotely (on another machine, typically the NCBI servers). We’re going to start this chapter by invoking the NCBI online BLAST service from within a Python script.
NOTE: The following Chapter [chapter:searchio] describes
Bio.SearchIO
, an experimental module in Biopython. We intend this to
ultimately replace the older Bio.Blast
module, as it provides a more
general framework handling other related sequence searching tools as
well. However, until that is declared stable, for production code please
continue to use the Bio.Blast
module for dealing with NCBI BLAST.
We use the function qblast()
in the Bio.Blast.NCBIWWW
module to call
the online version of BLAST. This has three non-optional arguments:
The first argument is the blast program to use for the search, as a
lower case string. The options and descriptions of the programs are
available at
https://blast.ncbi.nlm.nih.gov/Blast.cgi.. Currently
qblast
only works with blastn, blastp, blastx, tblast and tblastx.
The second argument specifies the databases to search against. Again, the options for this are available on the NCBI web pages at http://www.ncbi.nlm.nih.gov/BLAST/blast_databases.shtml.
The third argument is a string containing your query sequence. This can either be the sequence itself, the sequence in fasta format, or an identifier like a GI number.
The qblast
function also take a number of other option arguments which
are basically analogous to the different parameters you can set on the
BLAST web page. We’ll just highlight a few of them here:
The argument url_base sets the base URL for running BLAST over the internet. By default it connects to the NCBI, but one can use this to connect to an instance of NCBI BLAST running in the cloud. Please refer to the documentation for the qblast function for further details
The qblast
function can return the BLAST results in various
formats, which you can choose with the optional format_type
keyword: "HTML"
, "Text"
, "ASN.1"
, or "XML"
. The default is
"XML"
, as that is the format expected by the parser, described in
section [sec:parsing-blast] below.
The argument expect
sets the expectation or e-value threshold.
For more about the optional BLAST arguments, we refer you to the NCBI’s own documentation, or that built into Biopython:
from Bio.Blast import NCBIWWW
help(NCBIWWW.qblast)
Help on function qblast in module Bio.Blast.NCBIWWW: qblast(program, database, sequence, url_base='https://blast.ncbi.nlm.nih.gov/Blast.cgi', auto_format=None, composition_based_statistics=None, db_genetic_code=None, endpoints=None, entrez_query='(none)', expect=10.0, filter=None, gapcosts=None, genetic_code=None, hitlist_size=50, i_thresh=None, layout=None, lcase_mask=None, matrix_name=None, nucl_penalty=None, nucl_reward=None, other_advanced=None, perc_ident=None, phi_pattern=None, query_file=None, query_believe_defline=None, query_from=None, query_to=None, searchsp_eff=None, service=None, threshold=None, ungapped_alignment=None, word_size=None, short_query=None, alignments=500, alignment_view=None, descriptions=500, entrez_links_new_window=None, expect_low=None, expect_high=None, format_entrez_query=None, format_object=None, format_type='XML', ncbi_gi=None, results_file=None, show_overview=None, megablast=None, template_type=None, template_length=None) BLAST search using NCBI's QBLAST server or a cloud service provider. Supports all parameters of the old qblast API for Put and Get. Please note that NCBI uses the new Common URL API for BLAST searches on the internet (http://ncbi.github.io/blast-cloud/dev/api.html). Thus, some of the parameters used by this function are not (or are no longer) officially supported by NCBI. Although they are still functioning, this may change in the future. The Common URL API (http://ncbi.github.io/blast-cloud/dev/api.html) allows doing BLAST searches on cloud servers. To use this feature, please set ``url_base='http://host.my.cloud.service.provider.com/cgi-bin/blast.cgi'`` and ``format_object='Alignment'``. For more details, please see https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=CloudBlast Some useful parameters: - program blastn, blastp, blastx, tblastn, or tblastx (lower case) - database Which database to search against (e.g. "nr"). - sequence The sequence to search. - ncbi_gi TRUE/FALSE whether to give 'gi' identifier. - descriptions Number of descriptions to show. Def 500. - alignments Number of alignments to show. Def 500. - expect An expect value cutoff. Def 10.0. - matrix_name Specify an alt. matrix (PAM30, PAM70, BLOSUM80, BLOSUM45). - filter "none" turns off filtering. Default no filtering - format_type "HTML", "Text", "ASN.1", or "XML". Def. "XML". - entrez_query Entrez query to limit Blast search - hitlist_size Number of hits to return. Default 50 - megablast TRUE/FALSE whether to use MEga BLAST algorithm (blastn only) - short_query TRUE/FALSE whether to adjust the search parameters for a short query sequence. Note that this will override manually set parameters like word size and e value. Turns off when sequence length is > 30 residues. Default: None. - service plain, psi, phi, rpsblast, megablast (lower case) This function does no checking of the validity of the parameters and passes the values to the server as is. More help is available at: https://ncbi.github.io/blast-cloud/dev/api.html
Note that the default settings on the NCBI BLAST website are not quite the same as the defaults on QBLAST. If you get different results, you’ll need to check the parameters (e.g., the expectation value threshold and the gap values).
For example, if you have a nucleotide sequence you want to search against the nucleotide database (nt) using BLASTN, and you know the GI number of your query sequence, you can use:
from Bio.Blast import NCBIWWW
result_handle = NCBIWWW.qblast("blastn", "nt", "8332116")
Alternatively, if we have our query sequence already in a FASTA formatted file, we just need to open the file and read in this record as a string, and use that as the query argument:
from Bio.Blast import NCBIWWW
fasta_string = open("data/m_cold.fasta").read()
result_handle = NCBIWWW.qblast("blastn", "nt", fasta_string)
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) <ipython-input-18-5b69cf3d2fdc> in <module> 1 from Bio.Blast import NCBIWWW 2 fasta_string = open("data/m_cold.fasta").read() ----> 3 result_handle = NCBIWWW.qblast("blastn", "nt", fasta_string) ~/anaconda3/lib/python3.8/site-packages/Bio/Blast/NCBIWWW.py in qblast(program, database, sequence, url_base, auto_format, composition_based_statistics, db_genetic_code, endpoints, entrez_query, expect, filter, gapcosts, genetic_code, hitlist_size, i_thresh, layout, lcase_mask, matrix_name, nucl_penalty, nucl_reward, other_advanced, perc_ident, phi_pattern, query_file, query_believe_defline, query_from, query_to, searchsp_eff, service, threshold, ungapped_alignment, word_size, short_query, alignments, alignment_view, descriptions, entrez_links_new_window, expect_low, expect_high, format_entrez_query, format_object, format_type, ncbi_gi, results_file, show_overview, megablast, template_type, template_length) 254 255 request = Request(url_base, message, {"User-Agent": "BiopythonClient"}) --> 256 handle = urlopen(request) 257 results = handle.read().decode() 258 ~/anaconda3/lib/python3.8/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context) 220 else: 221 opener = _opener --> 222 return opener.open(url, data, timeout) 223 224 def install_opener(opener): ~/anaconda3/lib/python3.8/urllib/request.py in open(self, fullurl, data, timeout) 523 524 sys.audit('urllib.Request', req.full_url, req.data, req.headers, req.get_method()) --> 525 response = self._open(req, data) 526 527 # post-process response ~/anaconda3/lib/python3.8/urllib/request.py in _open(self, req, data) 540 541 protocol = req.type --> 542 result = self._call_chain(self.handle_open, protocol, protocol + 543 '_open', req) 544 if result: ~/anaconda3/lib/python3.8/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args) 500 for handler in handlers: 501 func = getattr(handler, meth_name) --> 502 result = func(*args) 503 if result is not None: 504 return result ~/anaconda3/lib/python3.8/urllib/request.py in https_open(self, req) 1391 1392 def https_open(self, req): -> 1393 return self.do_open(http.client.HTTPSConnection, req, 1394 context=self._context, check_hostname=self._check_hostname) 1395 ~/anaconda3/lib/python3.8/urllib/request.py in do_open(self, http_class, req, **http_conn_args) 1348 try: 1349 try: -> 1350 h.request(req.get_method(), req.selector, req.data, headers, 1351 encode_chunked=req.has_header('Transfer-encoding')) 1352 except OSError as err: # timeout error ~/anaconda3/lib/python3.8/http/client.py in request(self, method, url, body, headers, encode_chunked) 1238 encode_chunked=False): 1239 """Send a complete request to the server.""" -> 1240 self._send_request(method, url, body, headers, encode_chunked) 1241 1242 def _send_request(self, method, url, body, headers, encode_chunked): ~/anaconda3/lib/python3.8/http/client.py in _send_request(self, method, url, body, headers, encode_chunked) 1284 # default charset of iso-8859-1. 1285 body = _encode(body, 'body') -> 1286 self.endheaders(body, encode_chunked=encode_chunked) 1287 1288 def getresponse(self): ~/anaconda3/lib/python3.8/http/client.py in endheaders(self, message_body, encode_chunked) 1233 else: 1234 raise CannotSendHeader() -> 1235 self._send_output(message_body, encode_chunked=encode_chunked) 1236 1237 def request(self, method, url, body=None, headers={}, *, ~/anaconda3/lib/python3.8/http/client.py in _send_output(self, message_body, encode_chunked) 1004 msg = b"\r\n".join(self._buffer) 1005 del self._buffer[:] -> 1006 self.send(msg) 1007 1008 if message_body is not None: ~/anaconda3/lib/python3.8/http/client.py in send(self, data) 944 if self.sock is None: 945 if self.auto_open: --> 946 self.connect() 947 else: 948 raise NotConnected() ~/anaconda3/lib/python3.8/http/client.py in connect(self) 1407 server_hostname = self.host 1408 -> 1409 self.sock = self._context.wrap_socket(self.sock, 1410 server_hostname=server_hostname) 1411 ~/anaconda3/lib/python3.8/ssl.py in wrap_socket(self, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, session) 498 # SSLSocket class handles server_hostname encoding before it calls 499 # ctx._wrap_socket() --> 500 return self.sslsocket_class._create( 501 sock=sock, 502 server_side=server_side, ~/anaconda3/lib/python3.8/ssl.py in _create(cls, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, context, session) 1038 # non-blocking 1039 raise ValueError("do_handshake_on_connect should not be specified for non-blocking sockets") -> 1040 self.do_handshake() 1041 except (OSError, ValueError): 1042 self.close() ~/anaconda3/lib/python3.8/ssl.py in do_handshake(self, block) 1307 if timeout == 0.0 and block: 1308 self.settimeout(None) -> 1309 self._sslobj.do_handshake() 1310 finally: 1311 self.settimeout(timeout) KeyboardInterrupt:
We could also have read in the FASTA file as a SeqRecord
and then
supplied just the sequence itself:
from Bio.Blast import NCBIWWW
from Bio import SeqIO
record = SeqIO.read("data/m_cold.fasta", format="fasta")
result_handle = NCBIWWW.qblast("blastn", "nt", record.seq)
Supplying just the sequence means that BLAST will assign an identifier
for your sequence automatically. You might prefer to use the SeqRecord
object’s format method to make a FASTA string (which will include the
existing identifier):
from Bio.Blast import NCBIWWW
from Bio import SeqIO
record = SeqIO.read("data/m_cold.fasta", format="fasta")
result_handle = NCBIWWW.qblast("blastn", "nt", record.format("fasta"))
This approach makes more sense if you have your sequence(s) in a
non-FASTA file format which you can extract using Bio.SeqIO
(see
Chapter [5 - Sequence Input and Output](05 - Sequence Input and Output.ipynb).)
Whatever arguments you give the qblast()
function, you should get back
your results in a handle object (by default in XML format). The next
step would be to parse the XML output into Python objects representing
the search results (Section [sec:parsing-blast]), but you might want
to save a local copy of the output file first. I find this especially
useful when debugging my code that extracts info from the BLAST results
(because re-running the online search is slow and wastes the NCBI
computer time).
We need to be a bit careful since we can use result_handle.read()
to
read the BLAST output only once – calling result_handle.read()
again
returns an empty string.
with open("data/my_blast.xml", "w") as out_handle:
out_handle.write(result_handle.read())
result_handle.close()
After doing this, the results are in the file my_blast.xml
and the
original handle has had all its data extracted (so we closed it).
However, the parse
function of the BLAST parser (described
in [sec:parsing-blast]) takes a file-handle-like object, so we can
just open the saved file for input:
result_handle = open("data/my_blast.xml")
Now that we’ve got the BLAST results back into a handle again, we are ready to do something with them, so this leads us right into the parsing section (see Section [sec:parsing-blast] below). You may want to jump ahead to that now ….
Running BLAST locally (as opposed to over the internet, see Section [sec:running-www-blast]) has at least major two advantages:
Local BLAST may be faster than BLAST over the internet;
Local BLAST allows you to make your own database to search for sequences against.
Dealing with proprietary or unpublished sequence data can be another reason to run BLAST locally. You may not be allowed to redistribute the sequences, so submitting them to the NCBI as a BLAST query would not be an option.
Unfortunately, there are some major drawbacks too – installing all the bits and getting it setup right takes some effort:
Local BLAST requires command line tools to be installed.
Local BLAST requires (large) BLAST databases to be setup (and potentially kept up to date).
To further confuse matters there are several different BLAST packages available, and there are also other tools which can produce imitation BLAST output files, such as BLAT.
The “new” NCBI BLAST+ suite was released in 2009. This replaces the old NCBI “legacy” BLAST package (see below).
This section will show briefly how to use these tools from within Python. If you have already read or tried the alignment tool examples in Section [sec:alignment-tools] this should all seem quite straightforward. First, we construct a command line string (as you would type in at the command line prompt if running standalone BLAST by hand). Then we can execute this command from within Python.
For example, taking a FASTA file of gene nucleotide sequences, you might want to run a BLASTX (translation) search against the non-redundant (NR) protein database. Assuming you (or your systems administrator) has downloaded and installed the NR database, you might run:
blastx -query opuntia.fasta -db nr -out opuntia.xml -evalue 0.001 -outfmt 5
This should run BLASTX against the NR database, using an expectation cut-off value of $0.001$ and produce XML output to the specified file (which we can then parse). On my computer this takes about six minutes - a good reason to save the output to a file so you can repeat any analysis as needed.
From within Biopython we can use the NCBI BLASTX wrapper from the
Bio.Blast.Applications
module to build the command line string, and
run it:
from Bio.Blast.Applications import NcbiblastxCommandline
help(NcbiblastxCommandline)
Help on class NcbiblastxCommandline in module Bio.Blast.Applications: class NcbiblastxCommandline(_NcbiblastMain2SeqCommandline) | NcbiblastxCommandline(cmd='blastx', **kwargs) | | Wrapper for the NCBI BLAST+ program blastx (nucleotide query, protein database). | | With the release of BLAST+ (BLAST rewritten in C++ instead of C), the NCBI | replaced the old blastall tool with separate tools for each of the searches. | This wrapper therefore replaces BlastallCommandline with option -p blastx. | | >>> from Bio.Blast.Applications import NcbiblastxCommandline | >>> cline = NcbiblastxCommandline(query="m_cold.fasta", db="nr", evalue=0.001) | >>> cline | NcbiblastxCommandline(cmd='blastx', query='m_cold.fasta', db='nr', evalue=0.001) | >>> print(cline) | blastx -query m_cold.fasta -db nr -evalue 0.001 | | You would typically run the command line with cline() or via the Python | subprocess module, as described in the Biopython tutorial. | | Method resolution order: | NcbiblastxCommandline | _NcbiblastMain2SeqCommandline | _Ncbiblast2SeqCommandline | _NcbiblastCommandline | _NcbibaseblastCommandline | Bio.Application.AbstractCommandline | builtins.object | | Methods defined here: | | __init__(self, cmd='blastx', **kwargs) | Initialize the class. | | ---------------------------------------------------------------------- | Methods inherited from Bio.Application.AbstractCommandline: | | __call__(self, stdin=None, stdout=True, stderr=True, cwd=None, env=None) | Execute command, wait for it to finish, return (stdout, stderr). | | Runs the command line tool and waits for it to finish. If it returns | a non-zero error level, an exception is raised. Otherwise two strings | are returned containing stdout and stderr. | | The optional stdin argument should be a string of data which will be | passed to the tool as standard input. | | The optional stdout and stderr argument may be filenames (string), | but otherwise are treated as a booleans, and control if the output | should be captured as strings (True, default), or ignored by sending | it to /dev/null to avoid wasting memory (False). If sent to a file | or ignored, then empty string(s) are returned. | | The optional cwd argument is a string giving the working directory | to run the command from. See Python's subprocess module documentation | for more details. | | The optional env argument is a dictionary setting the environment | variables to be used in the new process. By default the current | process' environment variables are used. See Python's subprocess | module documentation for more details. | | Default example usage:: | | from Bio.Emboss.Applications import WaterCommandline | water_cmd = WaterCommandline(gapopen=10, gapextend=0.5, | stdout=True, auto=True, | asequence="a.fasta", bsequence="b.fasta") | print("About to run: %s" % water_cmd) | std_output, err_output = water_cmd() | | This functionality is similar to subprocess.check_output(). In general | if you require more control over running the command, use subprocess | directly. | | When the program called returns a non-zero error level, a custom | ApplicationError exception is raised. This includes any stdout and | stderr strings captured as attributes of the exception object, since | they may be useful for diagnosing what went wrong. | | __repr__(self) | Return a representation of the command line object for debugging. | | e.g. | | >>> from Bio.Emboss.Applications import WaterCommandline | >>> cline = WaterCommandline(gapopen=10, gapextend=0.5) | >>> cline.asequence = "asis:ACCCGGGCGCGGT" | >>> cline.bsequence = "asis:ACCCGAGCGCGGT" | >>> cline.outfile = "temp_water.txt" | >>> print(cline) | water -outfile=temp_water.txt -asequence=asis:ACCCGGGCGCGGT -bsequence=asis:ACCCGAGCGCGGT -gapopen=10 -gapextend=0.5 | >>> cline | WaterCommandline(cmd='water', outfile='temp_water.txt', asequence='asis:ACCCGGGCGCGGT', bsequence='asis:ACCCGAGCGCGGT', gapopen=10, gapextend=0.5) | | __setattr__(self, name, value) | Set attribute name to value (PRIVATE). | | This code implements a workaround for a user interface issue. | Without this __setattr__ attribute-based assignment of parameters | will silently accept invalid parameters, leading to known instances | of the user assuming that parameters for the application are set, | when they are not. | | >>> from Bio.Emboss.Applications import WaterCommandline | >>> cline = WaterCommandline(gapopen=10, gapextend=0.5, stdout=True) | >>> cline.asequence = "a.fasta" | >>> cline.bsequence = "b.fasta" | >>> cline.csequence = "c.fasta" | Traceback (most recent call last): | ... | ValueError: Option name csequence was not found. | >>> print(cline) | water -stdout -asequence=a.fasta -bsequence=b.fasta -gapopen=10 -gapextend=0.5 | | This workaround uses a whitelist of object attributes, and sets the | object attribute list as normal, for these. Other attributes are | assumed to be parameters, and passed to the self.set_parameter method | for validation and assignment. | | __str__(self) | Make the commandline string with the currently set options. | | e.g. | | >>> from Bio.Emboss.Applications import WaterCommandline | >>> cline = WaterCommandline(gapopen=10, gapextend=0.5) | >>> cline.asequence = "asis:ACCCGGGCGCGGT" | >>> cline.bsequence = "asis:ACCCGAGCGCGGT" | >>> cline.outfile = "temp_water.txt" | >>> print(cline) | water -outfile=temp_water.txt -asequence=asis:ACCCGGGCGCGGT -bsequence=asis:ACCCGAGCGCGGT -gapopen=10 -gapextend=0.5 | >>> str(cline) | 'water -outfile=temp_water.txt -asequence=asis:ACCCGGGCGCGGT -bsequence=asis:ACCCGAGCGCGGT -gapopen=10 -gapextend=0.5' | | set_parameter(self, name, value=None) | Set a commandline option for a program (OBSOLETE). | | Every parameter is available via a property and as a named | keyword when creating the instance. Using either of these is | preferred to this legacy set_parameter method which is now | OBSOLETE, and likely to be DEPRECATED and later REMOVED in | future releases. | | ---------------------------------------------------------------------- | Data descriptors inherited from Bio.Application.AbstractCommandline: | | __dict__ | dictionary for instance variables (if defined) | | __weakref__ | list of weak references to the object (if defined) | | ---------------------------------------------------------------------- | Data and other attributes inherited from Bio.Application.AbstractCommandline: | | parameters = None
blastx_cline = NcbiblastxCommandline(query="opuntia.fasta", db="nr", evalue=0.001,
outfmt=5, out="opuntia.xml")
blastx_cline
NcbiblastxCommandline(cmd='blastx', out='opuntia.xml', outfmt=5, query='opuntia.fasta', db='nr', evalue=0.001)
print(blastx_cline)
blastx -out opuntia.xml -outfmt 5 -query opuntia.fasta -db nr -evalue 0.001
# stdout, stderr = blastx_cline()
In this example there shouldn’t be any output from BLASTX to the
terminal, so stdout and stderr should be empty. You may want to check
the output file opuntia.xml
has been created.
As you may recall from earlier examples in the tutorial, the
opuntia.fasta
contains seven sequences, so the BLAST XML output should
contain multiple results. Therefore use Bio.Blast.NCBIXML.parse()
to
parse it as described below in Section [sec:parsing-blast].
NCBI BLAST+ (written in C++) was first released in 2009 as a replacement
for the original NCBI “legacy” BLAST (written in C) which is no longer
being updated. There were a lot of changes – the old version had a
single core command line tool blastall
which covered multiple
different BLAST search types (which are now separate commands in
BLAST+), and all the command line options were renamed. Biopython’s
wrappers for the NCBI “legacy” BLAST tools have been deprecated and will
be removed in a future release. To try to avoid confusion, we do not
cover calling these old tools from Biopython in this tutorial.
You may also come across Washington University
BLAST (WU-BLAST), and its successor, Advanced
Biocomputing BLAST (AB-BLAST, released in
2009, not free/open source). These packages include the command line
tools wu-blastall
and ab-blastall
, which mimicked blastall
from
the NCBI “legacy” BLAST suite. Biopython does not currently provide
wrappers for calling these tools, but should be able to parse any NCBI
compatible output from them.
As mentioned above, BLAST can generate output in various formats, such as XML, HTML, and plain text. Originally, Biopython had parsers for BLAST plain text and HTML output, as these were the only output formats offered at the time. Unfortunately, the BLAST output in these formats kept changing, each time breaking the Biopython parsers. Our HTML BLAST parser has been removed, but the plain text BLAST parser is still available (see Section [sec:parsing-blast-deprecated]). Use it at your own risk, it may or may not work, depending on which BLAST version you’re using.
As keeping up with changes in BLAST became a hopeless endeavor, especially with users running different BLAST versions, we now recommend to parse the output in XML format, which can be generated by recent versions of BLAST. Not only is the XML output more stable than the plain text and HTML output, it is also much easier to parse automatically, making Biopython a whole lot more stable.
You can get BLAST output in XML format in various ways. For the parser, it doesn’t matter how the output was generated, as long as it is in the XML format.
You can use Biopython to run BLAST over the internet, as described in section [sec:running-www-blast].
You can use Biopython to run BLAST locally, as described in section [sec:running-local-blast].
You can do the BLAST search yourself on the NCBI site through your web browser, and then save the results. You need to choose XML as the format in which to receive the results, and save the final BLAST page you get (you know, the one with all of the interesting results!) to a file.
You can also run BLAST locally without using Biopython, and save the output in a file. Again, you need to choose XML as the format in which to receive the results.
The important point is that you do not have to use Biopython scripts to
fetch the data in order to be able to parse it. Doing things in one of
these ways, you then need to get a handle to the results. In Python, a
handle is just a nice general way of describing input to any info source
so that the info can be retrieved using read()
and readline()
functions (see Section sec:appendix-handles).
If you followed the code above for interacting with BLAST through a
script, then you already have result_handle
, the handle to the BLAST
results. For example, using a GI number to do an online search:
from Bio.Blast import NCBIWWW
result_handle = NCBIWWW.qblast("blastn", "nt", "8332116")
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) <ipython-input-11-c06ee70e4d61> in <module> 1 from Bio.Blast import NCBIWWW ----> 2 result_handle = NCBIWWW.qblast("blastn", "nt", "8332116") ~/anaconda3/lib/python3.8/site-packages/Bio/Blast/NCBIWWW.py in qblast(program, database, sequence, url_base, auto_format, composition_based_statistics, db_genetic_code, endpoints, entrez_query, expect, filter, gapcosts, genetic_code, hitlist_size, i_thresh, layout, lcase_mask, matrix_name, nucl_penalty, nucl_reward, other_advanced, perc_ident, phi_pattern, query_file, query_believe_defline, query_from, query_to, searchsp_eff, service, threshold, ungapped_alignment, word_size, short_query, alignments, alignment_view, descriptions, entrez_links_new_window, expect_low, expect_high, format_entrez_query, format_object, format_type, ncbi_gi, results_file, show_overview, megablast, template_type, template_length) 244 wait = qblast._previous + delay - current 245 if wait > 0: --> 246 time.sleep(wait) 247 qblast._previous = current + wait 248 else: KeyboardInterrupt:
If instead you ran BLAST some other way, and have the BLAST output (in
XML format) in the file my_blast.xml
, all you need to do is to open
the file for reading:
result_handle = open("data/my_blast.xml")
Now that we’ve got a handle, we are ready to parse the output. The code to parse it is really quite small. If you expect a single BLAST result (i.e., you used a single query):
from Bio.Blast import NCBIXML
blast_record = NCBIXML.read(result_handle)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-8-13d8abd7fcf0> in <module> 1 from Bio.Blast import NCBIXML ----> 2 blast_record = NCBIXML.read(result_handle) NameError: name 'result_handle' is not defined
or, if you have lots of results (i.e., multiple query sequences):
from Bio.Blast import NCBIXML
blast_records = NCBIXML.parse(result_handle)
Just like Bio.SeqIO
and Bio.AlignIO
(see
Chapters [chapter:Bio.SeqIO] and [chapter:Bio.AlignIO]), we have a
pair of input functions, read
and parse
, where read
is for when
you have exactly one object, and parse
is an iterator for when you can
have lots of objects – but instead of getting SeqRecord
or
MultipleSeqAlignment
objects, we get BLAST record objects.
To be able to handle the situation where the BLAST file may be huge,
containing thousands of results, NCBIXML.parse()
returns an iterator.
In plain English, an iterator allows you to step through the BLAST
output, retrieving BLAST records one by one for each BLAST search
result:
from Bio.Blast import NCBIXML
blast_records = NCBIXML.parse(result_handle)
blast_record = next(blast_records)
print(blast_record.database_sequences)
# # ... do something with blast_record
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-7-eb6d7c6257b2> in <module> 1 from Bio.Blast import NCBIXML ----> 2 blast_records = NCBIXML.parse(result_handle) 3 blast_record = next(blast_records) 4 print(blast_record.database_sequences) 5 # # ... do something with blast_record NameError: name 'result_handle' is not defined
Or, you can use a for
-loop. Note though that you can step through the BLAST records only once. Usually, from each BLAST record you would save the information that you are interested in. If you want to save all returned BLAST records, you can convert the iterator into a list:
for blast_record in blast_records:
#Do something with blast_records
blast_records = list(blast_records)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-6-da239f64a171> in <module> ----> 1 for blast_record in blast_records: 2 #Do something with blast_records 3 blast_records = list(blast_records) NameError: name 'blast_records' is not defined
blast_records = list(blast_records)
Now you can access each BLAST record in the list with an index as usual. If your BLAST file is huge though, you may run into memory problems trying to save them all in a list.
Usually, you’ll be running one BLAST search at a time. Then, all you need to do is to pick up the first (and only) BLAST record in blast_records:
from Bio.Blast import NCBIXML
blast_records = NCBIXML.parse(result_handle)
I guess by now you’re wondering what is in a BLAST record.
A BLAST Record contains everything you might ever want to extract from the BLAST output. Right now we’ll just show an example of how to get some info out of the BLAST report, but if you want something in particular that is not described here, look at the info on the record class in detail, and take a gander into the code or automatically generated documentation – the docstrings have lots of good info about what is stored in each piece of information.
To continue with our example, let’s just print out some summary info about all hits in our blast report greater than a particular threshold. The following code does this:
E_VALUE_THRESH = 0.04
from Bio.Blast import NCBIXML
result_handle = open("data/my_blast.xml", "r")
blast_records = NCBIXML.parse(result_handle)
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if hsp.expect < E_VALUE_THRESH:
print("****Alignment****")
print("sequence:", alignment.title)
print("length:", alignment.length)
print("e value:", hsp.expect)
print(hsp.query[0:75] + "...")
print(hsp.match[0:75] + "...")
print(hsp.sbjct[0:75] + "...")
a membrane protein 2 (LOC108995251), transcript variant X2, mRNA length: 909 e value: 2.7506e-103 AATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATT---------GGCCGTGGCTAATATGATCGA... ||||||||| ||| | | || ||||||||||||||||||| |||| ||| || |||||||... AATGGGGAG-GAA--GGATTATTTGGCCATGAAAACTGATCCGGCCACGGCCACGGCCACGGCGGATTTGATCGA... ****Alignment**** sequence: gi|1882610309|ref|XM_018970776.2| PREDICTED: Juglans regia cold-regulated 413 plasma membrane protein 2 (LOC108995251), transcript variant X1, mRNA length: 1025 e value: 2.7506e-103 AATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATT---------GGCCGTGGCTAATATGATCGA... ||||||||| ||| | | || ||||||||||||||||||| |||| ||| || |||||||... AATGGGGAG-GAA--GGATTATTTGGCCATGAAAACTGATCCGGCCACGGCCACGGCCACGGCGGATTTGATCGA... ****Alignment**** sequence: gi|1389549632|ref|XM_006466626.3| PREDICTED: Citrus sinensis cold-regulated 413 plasma membrane protein 2 (LOC102620025), transcript variant X5, mRNA length: 880 e value: 2.11466e-98 AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATA... ||||||||||| |||| || ||||| |||||||||||| | || | || | ||||| || ||| ... AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAGCAGAGTTGATCAGCTCTGATT... ****Alignment**** sequence: gi|1389549631|ref|XM_006466625.2| PREDICTED: Citrus sinensis cold-regulated 413 plasma membrane protein 2 (LOC102620025), transcript variant X4, mRNA length: 946 e value: 2.11466e-98 AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATA... ||||||||||| |||| || ||||| |||||||||||| | || | || | ||||| || ||| ... AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAGCAGAGTTGATCAGCTCTGATT... ****Alignment**** sequence: gi|1389549630|ref|XM_006466624.3| PREDICTED: Citrus sinensis cold-regulated 413 plasma membrane protein 2 (LOC102620025), transcript variant X3, mRNA length: 956 e value: 2.11466e-98 AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATA... ||||||||||| |||| || ||||| |||||||||||| | || | || | ||||| || ||| ... AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAGCAGAGTTGATCAGCTCTGATT... ****Alignment**** sequence: gi|1389549629|ref|XM_006466623.3| PREDICTED: Citrus sinensis cold-regulated 413 plasma membrane protein 2 (LOC102620025), transcript variant X2, mRNA length: 1014 e value: 2.11466e-98 AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATA... ||||||||||| |||| || ||||| |||||||||||| | || | || | ||||| || ||| ... AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAGCAGAGTTGATCAGCTCTGATT... ****Alignment**** sequence: gi|1389549624|ref|XM_025094967.1| PREDICTED: Citrus sinensis cold-regulated 413 plasma membrane protein 2 (LOC102620025), transcript variant X1, mRNA length: 978 e value: 2.11466e-98 AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATA... ||||||||||| |||| || ||||| |||||||||||| | || | || | ||||| || ||| ... AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAGCAGAGTTGATCAGCTCTGATT... ****Alignment**** sequence: gi|1350315641|ref|XM_024180293.1| PREDICTED: Citrus clementina cold-regulated 413 plasma membrane protein 2 (LOC18037141), transcript variant X4, mRNA length: 868 e value: 2.11466e-98 AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATA... ||||||||||| |||| || ||||| |||||||||||| | || | || | ||||| || ||| ... AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAGCAGAGTTGATCAGCTCTGATT... ****Alignment**** sequence: gi|1350315638|ref|XM_006425719.2| PREDICTED: Citrus clementina cold-regulated 413 plasma membrane protein 2 (LOC18037141), transcript variant X3, mRNA length: 893 e value: 2.11466e-98 AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATA... ||||||||||| |||| || ||||| |||||||||||| | || | || | ||||| || ||| ... AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAGCAGAGTTGATCAGCTCTGATT... ****Alignment**** sequence: gi|1350315636|ref|XM_006425716.2| PREDICTED: Citrus clementina cold-regulated 413 plasma membrane protein 2 (LOC18037141), transcript variant X2, mRNA length: 881 e value: 2.11466e-98 AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATA... ||||||||||| |||| || ||||| |||||||||||| | || | || | ||||| || ||| ... AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAGCAGAGTTGATCAGCTCTGATT... ****Alignment**** sequence: gi|1350315634|ref|XM_006425717.2| PREDICTED: Citrus clementina cold-regulated 413 plasma membrane protein 2 (LOC18037141), transcript variant X1, mRNA length: 952 e value: 2.11466e-98 AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATA... ||||||||||| |||| || ||||| |||||||||||| | || | || | ||||| || ||| ... AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAGCAGAGTTGATCAGCTCTGATT... ****Alignment**** sequence: gi|1204884098|ref|XM_021445554.1| PREDICTED: Herrania umbratica cold-regulated 413 plasma membrane protein 2-like (LOC110429488), mRNA length: 905 e value: 2.11466e-98 AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATA... ||||||||||| ||| | || ||||| |||||||| |||| | || | | || ||||| ||| ||||... AAATGGGGAGA---ATGGACTATTTGGCTATGAAAACAGATCCTGTAGCAGAAG---ATTTGATCAGTTCTGATA... ****Alignment**** sequence: gi|1227938481|ref|XM_022049453.1| PREDICTED: Carica papaya cold-regulated 413 plasma membrane protein 2-like (LOC110820077), mRNA length: 1009 e value: 8.99177e-97 AGAAAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCG... ||||||||||||| ||| | || ||||| ||||| |||||||| |||| ||| || | ||| ||| |... AGAAAATGGGGAGG---ATGGAATATTTGGCTATGAAGACTGATCA---GGCCACTGCTGATCTCATCACTTCTG... ****Alignment**** sequence: gi|1063463253|ref|XM_007047033.2| PREDICTED: Theobroma cacao cold-regulated 413 plasma membrane protein 2 (LOC18611025), transcript variant X2, mRNA length: 1071 e value: 3.8234e-95 TGTGAACAGA-AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGAT... || |||| || ||||||||||| ||| | || ||||| |||||||| |||| | || | | || ||||... TGAGAACTGAGAAATGGGGAGA---ATGGACTATTTGGCTATGAAAACAGATCCTGTAGCAGAAG---ATTTGAT... ****Alignment**** sequence: gi|1063463252|ref|XM_007047032.2| PREDICTED: Theobroma cacao cold-regulated 413 plasma membrane protein 2 (LOC18611025), transcript variant X1, mRNA length: 1065 e value: 3.8234e-95 TGTGAACAGA-AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGAT... || |||| || ||||||||||| ||| | || ||||| |||||||| |||| | || | | || ||||... TGAGAACTGAGAAATGGGGAGA---ATGGACTATTTGGCTATGAAAACAGATCCTGTAGCAGAAG---ATTTGAT... ****Alignment**** sequence: gi|1269881407|ref|XM_022895605.1| PREDICTED: Durio zibethinus cold-regulated 413 plasma membrane protein 2 (LOC111300020), transcript variant X3, mRNA length: 1069 e value: 1.3345e-94 AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATA... ||||||||||| ||| |||| ||||| ||||||||||||| | || | | ||| ||||| ||| ||||... AAATGGGGAGA---ATGGAGTATTTGGCTATGAAAACTGATCCTGTAGCTGAAG--AAT-TGATCAGTTCTGATA... ****Alignment**** sequence: gi|1269881405|ref|XM_022895604.1| PREDICTED: Durio zibethinus cold-regulated 413 plasma membrane protein 2 (LOC111300020), transcript variant X2, mRNA length: 1091 e value: 1.3345e-94 AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATA... ||||||||||| ||| |||| ||||| ||||||||||||| | || | | ||| ||||| ||| ||||... AAATGGGGAGA---ATGGAGTATTTGGCTATGAAAACTGATCCTGTAGCTGAAG--AAT-TGATCAGTTCTGATA... ****Alignment**** sequence: gi|1269881403|ref|XM_022895603.1| PREDICTED: Durio zibethinus cold-regulated 413 plasma membrane protein 2 (LOC111300020), transcript variant X1, mRNA length: 1072 e value: 1.3345e-94 AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATA... ||||||||||| ||| |||| ||||| ||||||||||||| | || | | ||| ||||| ||| ||||... AAATGGGGAGA---ATGGAGTATTTGGCTATGAAAACTGATCCTGTAGCTGAAG--AAT-TGATCAGTTCTGATA... ****Alignment**** sequence: gi|1882636119|ref|XM_018974650.2| PREDICTED: Juglans regia cold-regulated 413 plasma membrane protein 2-like (LOC108998174), mRNA length: 1015 e value: 1.98057e-92 AATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATAT... ||||||||| ||||| || ||||| ||||||||||||| ||| ||| | || |||||| || |||||... AATGGGGAGG---ATGAATTATTTGGCTATGAAAACTGATCC---GGCAATGGATGATTTGATCGGCTCTGATAT... ****Alignment**** sequence: gi|1187397285|gb|KX009413.1| Santalum album COR413-PM2 mRNA, complete cds length: 837 e value: 6.91287e-92 AATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATAT... ||||||||| ||| | | ||||||||||||||| |||| ||||| | || ||||| ||||||| ||... AATGGGGAGG---ATGGATTTCTTGGCCATGAAAACAGATCCCGCGGCCGCCG---ATTTGATCAATTCCGACAT... ****Alignment**** sequence: gi|1079253150|ref|XM_009343631.2| PREDICTED: Pyrus x bretschneideri cold-regulated 413 plasma membrane protein 2-like (LOC103933927), mRNA length: 861 e value: 3.58095e-89 TGATCGATTCCGATATCAATGAGCTTAAAATGGCAACAATGAGGCTCATCAATGATGCTAGTATGCTCGGTCATT... |||| ||||| |||||||| ||||| || || ||| | | ||| ||||||| |||||| | | ||| ||| ||... TGATAGATTCAGATATCAAAGAGCTCAAGATTGCAGCCAAGAGACTCATCAGTGATGCCACCAAGCTTGGTGGTT... ****Alignment**** sequence: gi|1079253149|ref|XM_009343644.2| PREDICTED: Pyrus x bretschneideri cold-regulated 413 plasma membrane protein 2-like (LOC103933943), mRNA length: 876 e value: 3.58095e-89 TGATCGATTCCGATATCAATGAGCTTAAAATGGCAACAATGAGGCTCATCAATGATGCTAGTATGCTCGGTCATT... |||| ||||| |||||||| ||||| || || ||| | | ||| ||||||| |||||| | | ||| ||| ||... TGATAGATTCAGATATCAAAGAGCTCAAGATTGCAGCCAAGAGACTCATCAGTGATGCCACCAAGCTTGGTGGTT... ****Alignment**** sequence: gi|1079239703|ref|XM_009378191.2| PREDICTED: Pyrus x bretschneideri cold-regulated 413 plasma membrane protein 2 (LOC103965177), mRNA length: 885 e value: 3.58095e-89 TGATCGATTCCGATATCAATGAGCTTAAAATGGCAACAATGAGGCTCATCAATGATGCTAGTATGCTCGGTCATT... |||| ||||| |||||||| ||||| || || ||| | | ||| ||||||| |||||| | | ||| ||| ||... TGATAGATTCAGATATCAAAGAGCTCAAGATTGCAGCCAAGAGACTCATCAGTGATGCCACCAAGCTTGGTGGTT... ****Alignment**** sequence: gi|1350280614|ref|XM_024170292.1| PREDICTED: Morus notabilis cold-regulated 413 plasma membrane protein 2 (LOC21394987), mRNA length: 1020 e value: 1.24988e-88 AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATA... |||||||||| || || |||||||||||||| || | | ||| |||| || |||| |||| ||||... AAATGGGGAGGGAT------TATTTGGCCATGAAAACGGACCCA---GCCACGGCTGATTTGATAAATTCTGATA... ****Alignment**** sequence: gi|743838297|ref|XM_011027373.1| PREDICTED: Populus euphratica cold-regulated 413 plasma membrane protein 2 (LOC105126500), transcript variant X2, mRNA length: 1132 e value: 1.24988e-88 AAAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGAT... ||||||||||| ||| ||| ||| ||||| |||||| | | | |||||| | || |||||||||... AAAATGGGGAGG---ATGGAGTTTTTGAAGATGAAGACTGATGATGAAGTCAGCGCTAATTTAATTGATTCCGAT... ****Alignment**** sequence: gi|743838293|ref|XM_011027372.1| PREDICTED: Populus euphratica cold-regulated 413 plasma membrane protein 2 (LOC105126500), transcript variant X1, mRNA length: 980 e value: 1.24988e-88 AAAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGAT... ||||||||||| ||| ||| ||| ||||| |||||| | | | |||||| | || |||||||||... AAAATGGGGAGG---ATGGAGTTTTTGAAGATGAAGACTGATGATGAAGTCAGCGCTAATTTAATTGATTCCGAT... ****Alignment**** sequence: gi|1768569081|ref|XM_031406607.1| PREDICTED: Pistacia vera cold-regulated 413 plasma membrane protein 2-like (LOC116120644), mRNA length: 982 e value: 1.52266e-87 AAAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATT-GGCCGTGGCTAATATGATCGATTCCGA... ||||||||||| ||| | || ||| ||||||||||| |||| || ||| | |||| | || ||... AAAATGGGGAGG---ATGGATTATCTGGGAATGAAAACTGA-CAATCAGGTTACTGCTGAGGTGATTAACTCTGA... ****Alignment**** sequence: gi|1216950057|ref|XM_021815585.1| PREDICTED: Hevea brasiliensis cold-regulated 413 plasma membrane protein 2-like (LOC110658100), transcript variant X2, mRNA length: 1073 e value: 1.52266e-87 AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATA... |||||||||| ||| |||||||| |||| ||||||||| | | ||| || ||||| | || ||| ... AAATGGGGAGG---ATGGAGTACTTGAAAATGAGTACTGATCAAGTACC---GGCCGATTTGATCAAGTCTGATC... ****Alignment**** sequence: gi|1216950055|ref|XM_021815584.1| PREDICTED: Hevea brasiliensis cold-regulated 413 plasma membrane protein 2-like (LOC110658100), transcript variant X1, mRNA length: 1024 e value: 1.52266e-87 AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATA... |||||||||| ||| |||||||| |||| ||||||||| | | ||| || ||||| | || ||| ... AAATGGGGAGG---ATGGAGTACTTGAAAATGAGTACTGATCAAGTACC---GGCCGATTTGATCAAGTCTGATC... ****Alignment**** sequence: gi|1375871125|ref|XM_024605027.1| PREDICTED: Populus trichocarpa cold-regulated 413 plasma membrane protein 2 (LOC18101203), mRNA length: 1130 e value: 5.31461e-87 AAAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGAT... ||||||||||| ||| ||| ||| ||||| |||||| | | | |||||| | || || ||||||... AAAATGGGGAGG---ATGGAGTTTTTGAAGATGAAGACTGATGATGAAGTCAGCGCTAATTTAATTGAGTCCGAT... ****Alignment**** sequence: gi|1585724761|ref|XM_028202722.1| PREDICTED: Camellia sinensis cold-regulated 413 plasma membrane protein 2-like (LOC114262355), mRNA length: 910 e value: 1.85498e-86 AGAAAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCG... ||||||||||||| ||||| |||| ||||| ||||| || ||||| ||| | | ||| ||||||... AGAAAATGGGGAGGAAAATGGAGTATTTGGCAATGAAGACCGATCATCCAGCCCCAACCCAATCGATGAATTCCG... ****Alignment**** sequence: gi|1860377401|ref|XM_035077206.1| PREDICTED: Populus alba cold-regulated 413 plasma membrane protein 2-like (LOC118063227), transcript variant X2, mRNA length: 916 e value: 6.47451e-86 AAAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGAT... ||||||||||| ||| ||| ||| ||||| |||||| | | | | |||| | || || ||||||... AAAATGGGGAGG---ATGGAGTTTTTGAAGATGAAGACTGATGATGAAGTCAGCGGTAATTTAATTGAGTCCGAT... ****Alignment**** sequence: gi|1860377399|ref|XM_035077205.1| PREDICTED: Populus alba cold-regulated 413 plasma membrane protein 2-like (LOC118063227), transcript variant X1, mRNA length: 1109 e value: 6.47451e-86 AAAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGAT... ||||||||||| ||| ||| ||| ||||| |||||| | | | | |||| | || || ||||||... AAAATGGGGAGG---ATGGAGTTTTTGAAGATGAAGACTGATGATGAAGTCAGCGGTAATTTAATTGAGTCCGAT... ****Alignment**** sequence: gi|1162571919|ref|XM_020568695.1| PREDICTED: Prunus persica cold-regulated 413 plasma membrane protein 2 (LOC18770198), transcript variant X2, mRNA length: 929 e value: 7.88757e-85 TGATCGATTCCGATATCAATGAGCTTAAAATGGCAACAATGAGGCTCATCAATGATGCTAGTATGCTCGGTCATT... |||| |||| || |||||||| || || || ||| | | || |||||||||||||| | || ||| ||| |... TGATAAATTCAGACATCAATGATCTCAAGATTGCAGCCAAGAAACTCATCAATGATGCCACTAAGCTTGGTGGGT... ****Alignment**** sequence: gi|1162571918|ref|XM_007202530.2| PREDICTED: Prunus persica cold-regulated 413 plasma membrane protein 2 (LOC18770198), transcript variant X1, mRNA length: 811 e value: 7.88757e-85 TGATCGATTCCGATATCAATGAGCTTAAAATGGCAACAATGAGGCTCATCAATGATGCTAGTATGCTCGGTCATT... |||| |||| || |||||||| || || || ||| | | || |||||||||||||| | || ||| ||| |... TGATAAATTCAGACATCAATGATCTCAAGATTGCAGCCAAGAAACTCATCAATGATGCCACTAAGCTTGGTGGGT... ****Alignment**** sequence: gi|1229761331|ref|XM_022277554.1| PREDICTED: Momordica charantia cold-regulated 413 plasma membrane protein 2-like (LOC111005887), mRNA length: 850 e value: 2.75303e-84 ATTCCGATATCAATGAGCTTAAAATGGCAACAATGAGGCTCATCAATGATGCTAGTATGCTCGGTCATTACGGGT... |||| |||||||| ||||||||||| ||| | | |||||| | | |||| | | ||||||| | || ... ATTCTGATATCAACGAGCTTAAAATTGCAGCCACGAGGCTTCTTGAACATGCCACCAAGCTCGGTGGAAAGGGCC... ****Alignment**** sequence: gi|764593175|ref|XM_004300526.2| PREDICTED: Fragaria vesca subsp. vesca cold-regulated 413 plasma membrane protein 2 (LOC101313417), mRNA length: 1105 e value: 3.35388e-83 ATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAAT---ATGATCGATTCCGAT... |||||||| | || | || ||||| ||||||||||| | | | || ||| ||| |||| ||||||||... ATGGGGAGGG---TGGACTATTTGGCTATGAAAACTGACCCAGTTGC---GGCCAATGAGTTGATGAATTCCGAT... ****Alignment**** sequence: gi|1861285698|gb|MN544658.1| Populus simonii x Populus nigra cold temperature stress protein (WCOR413) mRNA, complete cds length: 639 e value: 3.35388e-83 GCTAATATGATCGATTCCGATATCAATGAGCTTAAAATGGCAACAATGAGGCTCATCAATGATGCTAGTATGCTC... |||||| | || || |||||| |||||||||| || | || | | || |||||||| ||||| || ||| ... GCTAATTTAATTGAGTCCGATGTCAATGAGCTCAAGGTTGCTGCTAAGAAACTCATCAAGGATGCCGCTAAGCTT... ****Alignment**** sequence: gi|1104507484|ref|XM_002274845.4| PREDICTED: Vitis vinifera cold-regulated 413 plasma membrane protein 2 (LOC100267774), mRNA length: 893 e value: 1.17062e-82 TGAACAGAAAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGA... |||| |||||||||||| ||| |||| |||| ||||||||||||| | | | ||| |||| |... TGAAACGAAAATGGGGAGG---ATGGAGTATCTGGCTATGAAAACTGATCCC--GAACCAACCCAAT-TGATTAA... ****Alignment**** sequence: gi|349709091|emb|FQ378501.1| Vitis vinifera clone SS0AEB13YG07 length: 834 e value: 1.17062e-82 TGAACAGAAAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGA... |||| |||||||||||| ||| |||| |||| ||||||||||||| | | | ||| |||| |... TGAAACGAAAATGGGGAGG---ATGGAGTATCTGGCTATGAAAACTGATCCC--GAACCAACCCAAT-TGATTAA... ****Alignment**** sequence: gi|1585658262|ref|XM_028225077.1| PREDICTED: Camellia sinensis cold-regulated 413 plasma membrane protein 2-like (LOC114282398), mRNA length: 1176 e value: 1.17062e-82 AAATGGGGAG-AGAA--ATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCG... |||||||||| || | ||| ||| ||||||||||| ||||||| ||||| ||| | ||||| ||||||... AAATGGGGAGGAGTATGATGGGGTATTTGGCCATGAAGACTGATC---TGGCCACGGCCGAATTGATCAATTCCG... ****Alignment**** sequence: gi|1220094463|ref|XM_021978417.1| PREDICTED: Prunus avium cold-regulated 413 plasma membrane protein 2-like (LOC110773902), mRNA length: 896 e value: 4.08586e-82 TGATCGATTCCGATATCAATGAGCTTAAAATGGCAACAATGAGGCTCATCAATGATGCTAGTATGCTCGGTCATT... |||| |||| || |||||||| || || || ||| | | || |||||||| ||||| | || ||| ||| |... TGATAAATTCAGACATCAATGATCTCAAGATTGCAGCCAAGAAACTCATCAAAGATGCCACTAAGCTTGGTGGGT... ****Alignment**** sequence: gi|1220047144|ref|XM_021953815.1| PREDICTED: Prunus avium cold-regulated 413 plasma membrane protein 2-like (LOC110753022), mRNA length: 894 e value: 4.08586e-82 TGATCGATTCCGATATCAATGAGCTTAAAATGGCAACAATGAGGCTCATCAATGATGCTAGTATGCTCGGTCATT... |||| |||| || |||||||| || || || ||| | | || |||||||| ||||| | || ||| ||| |... TGATAAATTCAGACATCAATGATCTCAAGATTGCAGCCAAGAAACTCATCAAAGATGCCACTAAGCTTGGTGGGT... ****Alignment**** sequence: gi|1216289774|ref|XM_021774063.1| PREDICTED: Manihot esculenta cold-regulated 413 plasma membrane protein 2-like (LOC110627712), mRNA length: 988 e value: 4.08586e-82 GAAAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATT---GGCCGTGGCTAATATGATCGATTC... |||||||||||| || |||||||| |||| ||||||||| | ||||| || |||| | ||... GAAAATGGGGAGG---ATTGAGTACTTGAAAATGAGTACTGATCAAGTAAAGGCCG------ATTTGATTCAATC... ****Alignment**** sequence: gi|1847903348|ref|XM_034833917.1| PREDICTED: Vitis riparia cold-regulated 413 plasma membrane protein 2-like (LOC117917592), mRNA length: 824 e value: 1.42611e-81 TGAACAGAAAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGA... |||| |||||||||||| ||| |||| |||| ||||||||||||| | || | ||| ||||| |... TGAAACGAAAATGGGGAGG---ATGGAGTATCTGGCTATGAAAACTGATCCC--GACCCAACCCAAT-TGATCAA... ****Alignment**** sequence: gi|1102738967|ref|XM_010256725.2| PREDICTED: Nelumbo nucifera cold-regulated 413 plasma membrane protein 2 (LOC104595819), mRNA length: 901 e value: 1.42611e-81 AAAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGAT... ||||||||||| || |||| |||||||||||||||||| || | | | |||| |||||| ... AAAATGGGGAGG---ATTCAGTATCTGGCCATGAAAACTGATCCGATGACAACCG---AGTTGATTAGTTCCGAC...
This will print out summary reports like the following: **Alignment** sequence: >gb|AF283004.1|AF283004 Arabidopsis thaliana cold acclimation protein WCOR413-like protein alpha form mRNA, complete cds length: 783 e value: 0.034 tacttgttgatattggatcgaacaaactggagaaccaacatgctcacgtcacttttagtcccttacatattcctc... ||||||||| | ||||||||||| || |||| || || |||||||| |||||| | | |||||||| ||| | |... tacttgttggtgttggatcgaaccaattggaagacgaatatgctcacatcacttctcattccttacatcttcttc...
Basically, you can do anything you want to with the info in the BLAST report once you have parsed it. This will, of course, depend on what you want to use it for, but hopefully this helps you get started on doing what you need to do!
An important consideration for extracting information from a BLAST
report is the type of objects that the information is stored in. In
Biopython, the parsers return Record
objects, either Blast
or
PSIBlast
depending on what you are parsing. These objects are defined
in Bio.Blast.Record
and are quite complete.
Here are my attempts at UML class diagrams for the Blast
and
PSIBlast
record classes. If you are good at UML and see
mistakes/improvements that can be made, please let me know. The Blast
class diagram is shown in the next figure
The PSIBlast record object is similar, but has support for the rounds that are used in the iteration steps of PSIBlast. The class diagram for PSIBlast is shown in the next figure.
You can run the standalone version of PSI-BLAST (the legacy NCBI command line tool blastpgp, or its replacement psiblast) using the wrappers in Bio.Blast.Applications module. At the time of writing, the NCBI does not appear to support tools running a PSI-BLAST search via the internet. Note that the Bio.Blast.NCBIXML parser can read the XML output from current versions of PSI-BLAST, but information like which sequences in each iteration is new or reused isn’t present in the XML file.
You can run the standalone version of RPS-BLAST (either the legacy NCBI command line tool rpsblast, or its replacement with the same name) using the wrappers in Bio.Blast.Applications module. At the time of writing, the NCBI does not appear to support tools running an RPS-BLAST search via the internet. You can use the Bio.Blast.NCBIXML parser to read the XML output from current versions of RPSBLAST.