How to restrict a BLAST search to include only a few protein sequences?

How to restrict a BLAST search to include only a few protein sequences?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I am performing a BLAST search and I need to filter the output to a certain family of proteins.

Specifically, I need to get matches within the CYP152 family of the Cytochrome P450 proteins. I can parse the results to just P450 proteins fairly easily as the "hit def" field in the output will contain "Cytochrome P450" somewhere in there if it is in fact a P450 protein sequence, but the same is not true for CYP152.

Is there some way to do this? I am using ExPASy BLAST:

Thanks in advance

The gene identifiers that you get in BLAST output depends on the identifiers used during the BLAST database construction. So you may not find the exact identifier you are looking for.

For restricting the BLAST results to a smaller set of proteins/nucleic acid you can choose either of the two:

  1. Make a database with the fasta sequences of the proteins/nucleic acids that you want. You have to install BLAST and run it from the commandline. Usemakeblastdbto make the database. Now you align your queries against this database. Note that now your e-values would not be the same as what you would obtain from a search against the bigger database.
  2. As you said in the question, you can parse the results later. You would have to know what identifiers were used in the blast database that you searched against. ExPASy-BLAST would most probably use the UniProt IDs (there is an ID convertor in the UniProt website to map UniProt IDs to other IDs). If you use NCBI-BLAST (nr database) then you would need to know the GenBank IDs.

Improving pairwise comparison of protein sequences with domain co-occurrence

Comparing and aligning protein sequences is an essential task in bioinformatics. More specifically, local alignment tools like BLAST are widely used for identifying conserved protein sub-sequences, which likely correspond to protein domains or functional motifs. However, to limit the number of false positives, these tools are used with stringent sequence-similarity thresholds and hence can miss several hits, especially for species that are phylogenetically distant from reference organisms. A solution to this problem is then to integrate additional contextual information to the procedure. Here, we propose to use domain co-occurrence to increase the sensitivity of pairwise sequence comparisons. Domain co-occurrence is a strong feature of proteins, since most protein domains tend to appear with a limited number of other domains on the same protein. We propose a method to take this information into account in a typical BLAST analysis and to construct new domain families on the basis of these results. We used Plasmodium falciparum as a case study to evaluate our method. The experimental findings showed an increase of 14% of the number of significant BLAST hits and an increase of 25% of the proteome area that can be covered with a domain. Our method identified 2240 new domains for which, in most cases, no model of the Pfam database could be linked. Moreover, our study of the quality of the new domains in terms of alignment and physicochemical properties show that they are close to that of standard Pfam domains. Source code of the proposed approach and supplementary data are available at:


BLAST (Basic Local Alignment Search Tool) is used to perform sequence similarity searches. Most often this means that BLAST is used to search a sequence (either DNA or protein) against a database of other sequences (either all nucleotide or all protein) in order to identify similar sequences. BLAST has many different flavors and can not only search DNA against DNA or protein against protein but also can translate a nucleotide query and search it against a protein database as well as the other way around. It can also compute a “profile” for the query sequence and use that for further searches as well as search the query against a database of profiles. BLAST is available as a web service at the NCBI, as a stand-alone binary, and is built into other tools. It is an extremely versatile program and probably the most heavily used similarity search program in the world. BLAST runs on a multitude of different platforms that include Windows, MacOS, LINUX, and many flavors of UNIX. It is also under continuing development with new algorithmic innovations. Multiple references to BLAST can be found at

The version of BLAST in the NCBI C++ Toolkit was rewritten from scratch based upon the version in the C Toolkit that was originally introduced in 1997. A decision was made to break the code for the new version of BLAST into two different categories. There is the “core” code of BLAST that is written in vanilla C and does not use any part of the NCBI C or C++ Toolkits. There is also the “API” code that is written in C++ and takes full advantage of the tools provided by the NCBI C++ Toolkit. The reason to write the core part of the code in vanilla C was so that the same code could be used in the C Toolkit (to replace the 1997 version) as well as to make it possible for researchers interested in algorithmic development to work with the core of BLAST independently of any Toolkit. Even though the core part was written without the benefit of the C++ or C Toolkits an effort was made to conform to the Programming Policies and Guidelines chapter of this book. Doxygen-style comments are used to allow API documentation to be automatically generated (see the BLAST Doxygen link at Both the core and API parts of BLAST can be found under algo/blast in the C++ Toolkit.

An attempt was made to isolate the user of the BLAST API (as exposed in algo/blast/api ) from the core of BLAST, so that algorithmic enhancements or refactoring of that code would be transparent to the API programmer as far as that is possible. Since BLAST is continually under development and many of the developments involve new features it is not always possible or desirable to isolate the API programmer from these changes. This chapter will focus on the API for the C++ Toolkit. A few different search classes will be discussed. These include the CLocalBlast class, typically used for searching a query (or queries) against a BLAST database CRemoteBlast, used for sending searches to the NCBI servers as well as CBl2Seq, useful for searching target sequences that have not been formatted as a BLAST database.


As shown in Table 1, ProSplicer uses material on 21,786 genes from ENSEMBL [5], a total of 2,311,460 sequences including protein, mRNA, and EST sequences, to investigate local sequence similarities that can reveal alternative splicing variants. The number of exon candidates generated by alignment tools is shown in Table 1: 442,077, 395,619, and 12,361,685 exon candidates are predicted by aligning protein sequences, mRNA sequences and EST sequences, respectively, against the genomic sequences. ProSplicer also takes mouse protein sequences into account to reveal the cross-species comparison of the alternative splicing variants of a gene. That is, the mouse protein sequences are aligned to human genomic sequences and matching blocks are generated to be the exon candidates.

Query interfaces

In ProSplicer, all the related evidence sequences, that is, mRNA, EST and protein sequences that are maintained in the database, are pre-aligned to the genomic gene sequences. All alternative splicing variants revealed by the alignment after the filtering phase are also stored. Both text and graphical information are provided in ProSplicer, as well as the query interfaces provided via the web.

By considering the alternative splicing forms of a gene provided in ProSplicer, an exon might be left out or selected after comparison to other protein, mRNA or EST sequences. The three major types of alternative splicing events, including exon skipping, alternative 5' splicing donor sites, and alternative 3' splicing acceptor sites [6] are included in the database. Figure 1 shows an example of three types of alternative splicing events. The three types of alternative splicing forms can be shown directly in the graphical user interface in ProSplicer, as shown in Figure 2.

Comparison of pairs of transcripts from the same gene showing three types of alternative splicing events. The shaded bars indicate the exon candidates the thin lines indicate the intron regions.

An example ProSplicer analysis showing how the three types of alternative splicing event will be displayed.

Search tools

ProSplicer provides several keyword search criteria, such as Ensembl gene identification numbers, gene symbols or names, protein id, and UniGene id. Users can submit a gene symbol as keyword and the database returns the query result containing the keyword. All the gene-related information, including supporting evidence, that is mRNA, EST and protein sequences, are also provided in the interface. Protein id and UniGene id can also be submitted by the user and the query result returns the genes that are supported by the query protein sequences or UniGene clusters.

Gene information

ProSplicer provides related reference links to other biological databases and sequences related to the genes selected. The related annotations and reference database links of a gene include Ensembl id numbers, gene symbols, genomic locations and gene descriptions. As shown in Figure 3, the available reference links include GO (Gene Ontology data) [7], HUGO (providing access to the list of currently approved human gene symbols) [8], GeneCard [9] (integrating human genes, their products and their involvement in diseases), LocusLink [10] (organizing information around genes to generate a central hub for accessing gene-specific information), RefSeq [11] (providing reference sequence standards for genomes, transcripts and proteins) and OMIM [12].

Gene information and links in ProSplicer.

Graphical splicing view

The splicing view consists of two parts - 'overview' and 'detailed view'. The overview interface provides a graphical view of the selected gene's location on the chromosome. Figure 4 shows the detailed view in ProSplicer. There are two graphical components in the detailed view. The first is an adjustment bar to scale and move the viewer along the chromosome. The second shows the alignment result of mRNA, EST and protein sequence against the gene genomic sequences to reveal the alternative splicing variants. The graphical interface provides the following functions.

The 'detailed view' graphical interface in ProSplicer.

Jumping to specific region. You can jump to a user-specified region of the genomic sequence (see A in Figure 4) where all the related sequences and alignment result are also shown in the detailed view.

Scaling the view. You can scale the view to 1/8,1/4,1/2, 2, 4 or 8 times its current window size (see B in Figure 4).

Moving the view. You can move to the left or right of the current view (see C in Figure 4).

Also shown in Figure 4 is the main graphical view of the alternative splicing view. This comprises basic gene information: gene id (D), gene symbol (E), and gene description (F). The items provided in the splicing view are: the quality of alignment (G), with the degree of similarity between matching blocks, that is, exon candidates, represented by different colors the length of the selected gene region (H) and sequence identification (I) - each 'Sequence ID' of a nucleotide or protein sequence is hyperlinked to SWISS-PROT, GenBank and dbEST. When you click on an exon candidate (J), a new browsing window opens showing the alignment flat file. The different color fills of the exon blocks refer to the alignment quality as set out in G. When you click on an intron block, a new browsing window showing the alignment flat file opens. The display also includes tissue information, with different tissues represented in different colors, and species information on the source organism of the protein or mRNA sequences.

A comparison of existing alternative splicing databases and tools

Several alternative splicing databases, such as AsMamDB [13], ASDB [14] and SpliceDB [15], are constructed on the basis of genes annotated containing the keywords 'alternative splicing'. AsMamDB contains information about alternative splicing in several mammals. SpliceNest [16], SpliceDB, AsMmDB, and HASDB [17] map clustered ESTs onto human genomic DNA to compute gene structures and splice variants. PALS db [6] takes the longest mRNA sequence in each UniGene [18] cluster as the reference sequence, which is aligned with ESTs and mRNA sequences in the same cluster to predict alternative splicing sites. The BLAT server [19] is a BLAST-like alignment tool that aligns an input nucleotide sequence to human genomic sequences, mRNA, EST and protein sequences. BLAT builds an index of the database and then scans linearly through the query sequence for local alignments. It then stitches them together into a larger alignment. Finally, BLAT revisits small internal exons possibly missed at the first stage and, where feasible, adjusts large gap boundaries that have canonical splice sites. BLAT is more accurate and 500 times faster than popular existing tools for mRNA/DNA alignments. BLAT is very effective for doing alignments between mRNA and genomic DNA from the same species, and can reveal splicing variants from the alignment result. ProSplicer pre-aligns known and novel gene sequences to the available mRNA, EST and protein sequences. ProSplicer is useful when the user wants to find alternative splicing variants by inputing a gene. We briefly summarize the difference between ProSplicer and BLAT as follows.

First, researchers can input gene names in ProSplicer as opposed to nucleotide sequences in the query stage. Second, the methods of alignment and filtering of sequences are most likely to be very different. We describe our method more fully in the Materials and method section. Third, in ProSplicer, links to various databases and functional information on particular genes (OMIM, RefSeq, GO, HUGO, and so on) are provided.

A comparison of several alternative splicing databases and tools is given in Table 2. The column 'Referenced sequence' indicates genomic sequences, or the longest mRNA sequence in UniGene clusters when used. The 'Types of sequence supported' column shows the materials, including proteins, mRNA or EST sequences, which are used to analyze and then investigate alternative splicing forms of genes. The alignment tool used in each approach is also shown. Whether the alternative splicing criterion for inclusion of genes has been determined through literature search or not is also given in Table 2.

New ribosomal RNA BLAST databases available on the web BLAST service and for download

We have a curated set of ribosomal RNA (rRNA) reference sequences (Targeted Loci) with verifiable organism sources and current names. This set is critical for correctly identifying and classifying prokaryotic (bacteria and archaea) and fungal samples (Table 1). To provide easy access to these sequences, we recently added a separate rRNA/ITS databases section on the nucleotide BLAST page for these targeted sequences that makes it convenient to quickly identify source organisms (Figure 1)

BLAST Service¶

I. Locating the BLAST Service¶

At the top of any PATRIC page, find the Services tab. Click on BLAST.

This will open up the BLAST landing page where researchers can do nucleotide or amino acid BLAST searches.

II. Loading a Sequence and Choosing a Type of BLAST¶

Cut and paste a sequence into the Sequence box. Depending upon the sequence, this will open the drop down box under Program, showing the types of BLAST available(1). Definitions of the types of BLAST searches are as follows:

BLAST: searches nucleotide databases using a nucleotide query

BLASTP: searches protein databases using a protein query

BLASTX: searches protein databases using a translated nucleotide query

TBLASTX: searches translated nucleotide databases using a translated nucleotide query

TBLASTN: searches translated nucleotide subjects using a protein query

Note that all the format for all BLAST submissions requires that first line begins with >. If the first line lacks the > the BLAST job will fail.

Clicking on the algorithm of choice will close the drop down box and display the choice in the Program text box.

III. Selecting a Database¶

PATRIC provides a variety of databases that selected sequences can be compared to. If the sequence selected is a protein, the available databases are as follows:

The Reference or Representative Genome proteins (faa) includes those genomes that RefSeq has given special status(2). The reference genomes represent the highest quality dataset that is supported by curation by NCBI scientific staff, and the representative genomes are another high-quality selection that were identified at RefSeq by clustering genomes and applying weighting metrics that include consideration of species-level taxonomic classification (e.g., a preference for type strain) and assembly quality (e.g. a preference for complete genomes but WGS is allowed).

Transcriptomics Genomes proteins (faa) include all the proteins that are found in any of the microarray experiments that are included in PATRIC.

Specialty gene reference proteins (faa) contain all the genes used by PATRIC to tag genes that are of special interest. These include genes that have been identified as being virulence factors, as important in antibiotic resistance or susceptibility, are homologs with human genes, or have been investigated as being a drug target.

Search within a selected genome(s) allows researchers to choose specific genomes that they wish to BLAST against.

Search within selected genome group allows researcher to BLAST against any of the genome groups that they have created and are stored in their workspace.

Search with selected taxon allows researchers to BLAST their sequence against any taxon level available in PATRIC.

If the sequence selected for BLAST analysis is nucleotide, the available databases are as follows: • The Reference or Representative Genome genes (fna), or fasta nucleic acid, includes those genomes that RefSeq has given special status(2). .fna is used generically to specify nucleic acids. The reference genomes represent the highest quality dataset that is supported by curation by NCBI scientific staff, and the representative genomes are another high-quality selection that were identified at RefSeq by clustering genomes and applying weighting metrics that include consideration of species-level taxonomic classification (e.g., a preference for type strain) and assembly quality (e.g. a preference for complete genomes but WGS is allowed). This will include non-coding sequences, like intergenic regions.

The Reference or Representative Genome features (ffn) is the FASTA nucleotide of gene regions, and this database contains all the coding regions across this special selection of genomes.

The Reference and Representative Genome features (frn) is the FASTA non-coding RNA, and includes all the non-coding RNA regions for a genome (tRNA, rRNA).

PATRIC 16sRNA genes (frn) includes all the 16s rRNA genes across all the genomes available in PATRIC.

Transcriptomic genomes (ffn) will BLAST against all the genome sequences that have expression data associated with them that are publicly available in PATRIC. This will include non-coding sequences, like intergenic regions.

Transcriptomics Genomes feature (ffn) will BLAST against all the coding sequences from the genomes that have expression data associated with them that are publicly available in PATRIC.

Plasmid contigs (fna) will BLAST against all the sequences identified as coming from plasmids that are available in PATRIC. This will include non-coding sequences, like intergenic regions.

Search within selected genomes allows researchers to choose specific genomes that they wish to BLAST against.

Search within selected genome group allows researcher to BLAST against any of the genome groups that they have created and are stored in their workspace.

Search with selected taxon allows researchers to BLAST their sequence against any taxon level available in PATRIC.

IV. BLASTing Against Gene Features or Contigs¶

Depending upon query type, researchers will be able to choose to search entire genomes or limit the search to only features. When BLASTN, TBLASTN, or TBLASTX are selected, researchers can choose to search against either contigs or features. When BLASTP or BLASTX are selected, the search is limited to features.

V. Adjusting the BLAST¶

Once a database to BLAST against is selected, researchers have the option of further refining the BLAST job by using the Advanced Options.

Researchers can adjust both the number of hits returned, and the E value threshold. There are limits to the number of hits returned. To see the available number, click on the arrow at the end of the text box under Max Hits. This will open a drop down box that allows researchers to choose 1, 10, 50, 100 or 500 hits.

VI. Submitting the BLAST Job¶

Once the sequence has been uploaded, the program and database selected, the BLAST parameters adjusted, the job can be started by clicking the Search button at the bottom of the page.

VII. Examining the BLAST Results¶

When the BLAST results are ready, the page will reload showing the name of the organism, the query and subject coverage, the score and the E value. Depending on the type of BLAST selected, researchers will also see locus tags, gene symbols and functional descriptions for the features, or information about the genomic contigs.

Clicking on a single check box in front of a specific return will do two things. It will populate the vertical green bar with all the possible downstream analysis tools or processes that can be deployed with that selection. With a single choice, those possibilities include

The ability to download information on the sequence

The fasta file (protein or nucleotide)

The ability to look at other identifiers linked to the gene by the ID Mapping tool, what pathway includes the selected gene

An ability to either create a new group that includes the feature, or to add it to an existing group,

A direct link to the genome landing page that the feature belongs to.

A direct link to the feature landing page for that gene or protein. When a single result is selected, the information about that specific choice also appears beyond the green bar.

Clicking on multiple check boxes will also populate the vertical green bar with the downstream analysis tools or processes. These are similar to those enabled when a single return was selected, with a few differences that include:

The ability to generate a multiple sequence alignment (MSA)

The ability to go to a specific landing page that summarizes all the data across the selected features (Features icon)

The ability to go to a specific landing page that summarizes all the data across the genomes that contain the selected features (Genomes icon)

Some examples of downstream process are demonstrated, specifically obtaining sequences, generating gene trees/multiple sequence alignment, and the summarization of selected genes on a KEGG pathway map.

VIII. Submitting Another BLAST Job¶

At the top of the BLAST result page, researchers can click on the Edit from and resubmit button to initiate another BLAST job

This will reload the page, showing the original parameters used for the first BLAST job. These can be adjusted, with a second job submitted by clicking the Search button at the bottom of the page.

Characterization of Tannase Protein Sequences of Bacteria and Fungi: An In Silico Study

The tannase protein sequences of 149 bacteria and 36 fungi were retrieved from NCBI database. Among them only 77 bacterial and 31 fungal tannase sequences were taken which have different amino acid compositions. These sequences were analysed for different physical and chemical properties, superfamily search, multiple sequence alignment, phylogenetic tree construction and motif finding to find out the functional motif and the evolutionary relationship among them. The superfamily search for these tannase exposed the occurrence of proline iminopeptidase-like, biotin biosynthesis protein BioH, O-acetyltransferase, carboxylesterase/thioesterase 1, carbon–carbon bond hydrolase, haloperoxidase, prolyl oligopeptidase, C-terminal domain and mycobacterial antigens families and alpha/beta hydrolase superfamily. Some bacterial and fungal sequence showed similarity with different families individually. The multiple sequence alignment of these tannase protein sequences showed conserved regions at different stretches with maximum homology from amino acid residues 389–469 and 482–523 which could be used for designing degenerate primers or probes specific for tannase producing bacterial and fungal species. Phylogenetic tree showed two different clusters one has only bacteria and another have both fungi and bacteria showing some relationship between these different genera. Although in second cluster near about all fungal species were found together in a corner which indicates the sequence level similarity among fungal genera. The distributions of fourteen motifs analysis revealed Motif 1 with a signature amino acid sequence of 29 amino acids, i.e. GCSTGGREALKQAQRWPHDYDGIIANNPA, was uniformly observed in 83.3 % of studied tannase sequences representing its participation with the structure and enzymatic function.

This is a preview of subscription content, access via your institution.

After initially studying physics, Gish obtained an A.B. degree in Biochemistry from University of California, Berkeley, and completed work for his Ph.D. degree in Molecular Biology at the same institution in 1988. [1]

Gish is primarily known for his contributions to NCBI BLAST, [4] [5] his creation of the BLAST Network Service and nr (non-redundant) databases, his 1996 release of the original gapped BLAST (WU-BLAST 2.0), and most recently his development and support of AB-BLAST. At Washington University in St. Louis, Gish also led the genome analysis group which annotated all finished human, mouse and rat genome data produced by the University's Genome Sequencing Center from 1995 through 2002.

As a graduate student, Gish applied the Quine-McCluskey algorithm to the analysis of splice site recognition sequences. In 1985, with a view toward rapid identification of restriction enzyme recognition sites in DNA, Gish developed a DFA function library in the C language. The idea to apply a finite-state machine to this problem had been suggested by fellow graduate student and BSD UNIX developer Mike Karels. Gish's DFA implementation was that of a Mealy machine architecture, which is more compact than an equivalent Moore machine and hence faster. Construction of the DFA was O(n), where n is the sum of the lengths of the query sequences. The DFA could then be used to scan subject sequences in a single pass with no backtracking in O(m) time, where m is the total length of the subject(s). The method of DFA construction was recognized later as being a consolidation of two algorithms, Algorithms 3 and 4 described by Alfred V. Aho and Margaret J. Corasick. [6]

While working for U.C. Berkeley in December 1986, Gish sped up the FASTP program [7] (later known as FASTA [8] ) of William R. Pearson and David J. Lipman by 2- to 3-fold without altering the results. When the performance modifications were communicated to Pearson and Lipman, Gish further suggested that a DFA (rather than a lookup table) would yield faster k-tuple identification and improve the overall speed of the program by perhaps as much as 10% in some cases however such marginal improvement even in the best case was deemed by the authors to not be worth the added code complexity. Gish also envisioned at this time a centralized search service, wherein all nucleotide sequences from GenBank would be maintained in memory to eliminate I/O bottlenecks—and stored in compressed form to conserve memory—with clients invoking FASTN searches remotely via the Internet.

Gish's earliest contributions to BLAST were made while working at the NCBI, starting in July 1989. Even in early prototypes BLAST was typically much faster than FASTA. Gish recognized the potential added benefit in this application of using a DFA for word-hit recognition. He morphed his earlier DFA code into a flexible form that he incorporated into all BLAST search modes. Others of his contributions to BLAST include: the use of compressed nucleotide sequences, both as an efficient storage format and as a rapid, native search format parallel processing memory-mapped I/O the use of sentinel bytes and sentinel words at the start and end of sequences to improve the speed of word-hit extension the original implementations of BLASTX, [9] TBLASTN [4] and TBLASTX (unpublished) the transparent use of external (plug-in) programs such as seg, xnu, and dust to mask low-complexity regions in query sequences at run time the NCBI BLAST E-mail Service with optional public key-encrypted communications the NCBI Experimental BLAST Network Service the NCBI non-redundant (nr) protein and nucleotide sequence databases, typically updated on a daily basis with all data from GenBank, Swiss-Prot, and the PIR. Gish developed the first BLAST API, which was used in EST [10] annotation and Entrez data production, as well as in the NCBI BLAST version 1.4 application suite (Gish, unpublished). Gish was also the creator of and project manager for the earliest NCBI Dispatcher for distributed services (inspired by CORBA's Object Request Broker). First opened to outside users in December 1989, the NCBI Experimental BLAST Network Service, running the latest BLAST software on SMP hardware against the latest releases of the major sequence databases, quickly established the NCBI as a convenient, one-stop shop for sequence similarity searching.

At Washington University in St. Louis, Gish revolutionized similarity searching by developing the first BLAST suite of programs to combine rapid gapped sequence alignment with statistical evaluation methods appropriate for gapped alignment scores. The resulting search programs were significantly more sensitive but only marginally slower than ungapped BLAST, due to novel application of the BLAST dropoff score X during gapped alignment extension. Sensitivity of gapped BLAST was further improved by the novel application of Karlin-Altschul Sum statistics [11] to the evaluation of multiple, gapped alignment scores in all BLAST search modes. Sum statistics were originally developed analytically for the evaluation of multiple, ungapped alignment scores. The empirical use of Sum statistics in the treatment of gapped alignment scores was validated in collaboration with Stephen Altschul, from 1994-1995. In May 1996, WU-BLAST version 2.0 with gapped alignments was publicly released in the form of a drop-in upgrade for existing users of ungapped NCBI BLAST and WU-BLAST (both at version 1.4, after having forked in 1994). Little NIH funding was received for his WU-BLAST development, with an average of 20% FTE starting in November 1995, and ending shortly after the September 1997 release of the NCBI gapped BLAST (“blastall”). As an option to WU-BLAST, Gish implemented a faster, more memory-efficient and more sensitive two-hit BLAST algorithm than was used by the NCBI software for many years. In 1999, Gish added support to WU-BLAST for the Extended Database Format (XDF), the first BLAST database format capable of accurately representing the entire draft sequence of the human genome in full-length chromosome sequence objects. This was also the first time any BLAST package introduced a new database format transparently to existing users, without abandoning support for prior formats, as a result of abstracting the database I/O functions away from the data analysis functions. WU-BLAST with XDF was the first BLAST suite to support indexed-retrieval of NCBI standard FASTA-format sequence identifiers (including the entire range of NCBI identifiers) the first to allow retrieval of individual sequences in part or in whole, natively, translated or reverse-complemented and the first able to dump the entire contents of a BLAST database back into human-readable FASTA format. In 2000, unique support for reporting of links (consistent sets of HSPs also called chains in some later software packages) was added, along with the ability for users to limit the distance between HSPs allowed in the same set to a biologically relevant length (e.g., the length of the expected longest intron in the species of interest) and with the distance limitation entering into the calculation of E-values. Between 2001-2003, Gish improved the speed of the DFA code used in WU-BLAST. Gish also proposed multiplexing query sequences to speed up BLAST searches by an order of magnitude or more (MPBLAST) implemented segmented sequences with internal sentinel bytes, in part to aid multiplexing with MPBLAST and in part to aid analysis of segmented query sequences from shotgun sequencing assemblies and directed use of WU-BLAST as a fast, flexible search engine for accurately identifying and masking genome sequences for repetitive elements and low-complexity sequences (the MaskerAid [12] package for RepeatMasker). With doctoral student Miao Zhang, Gish directed development of EXALIN, [13] which significantly improved the accuracy of spliced alignment predictions, by a novel approach that combined information from donor and acceptor splice site models with information from sequence conservation. Although EXALIN performed full dynamic programming by default, it could optionally utilize the output from WU-BLAST to seed the dynamic programming and speed up the process by about 100-fold with little loss of sensitivity or accuracy.

In 2008, Gish founded Advanced Biocomputing, LLC, where he continues to improve and support the AB-BLAST package. [ citation needed ]


Horizontal gene transfer can be defined as the movement of genetic material between phylogenetically unrelated organisms by mechanisms other than parent to progeny inheritance. Any biological advantage provided to the recipient organism by the transferred DNA creates selective pressure for its retention in the host genome. A number of recent reviews describe several well-established pathways of horizontal transfer [1–4]. Evidence for the unexpectedly high frequency of horizontal transmission has spawned a major re-evaluation in scientific thinking about how taxonomic relationships should be modeled [4–9]. It is now considered a major factor in the process of environmental adaptation, for both individual species and entire microbial populations. Horizontal transfer has also been proposed to play a role in the emergence of novel human diseases, as well as determining their virulence [10, 11].

There is currently no single bioinformatics tool capable of systematically identifying all laterally acquired genes in an entire genome. Available methods for identifying horizontal transfer generally rely on finding anomalies in either nucleotide composition or phylogenetic relationships with orthologous proteins. Nucleotide content and phylogenetic relatedness methods have the advantage of being independent of each other, but often give completely different results. There is no 'gold standard' to determine which, if either, is correct, but it has been suggested that different methodologies may be detecting lateral transfer events of different relative ages [2, 12].

In addition to having good sensitivity and specificity, ideal tools for identifying horizontal transfer at the genomic level should be computationally efficient and automated. The current environment of rapid database expansion may require analyses to be re-performed frequently, in order to take advantage of both new genome sequences and new annotation information describing previously unknown protein functions. Re-analysis using updated data may provide new insights, or even change conclusions completely.

A variety of strategies have been used to predict horizontal gene transfer using nucleotide composition of coding sequences. Early methods flagged genes with atypical G + C content later methods evaluate codon usage patterns as predictors of horizontal transfer [13–15]. A variety of so called 'genomic signature' models have been proposed, using nucleotide patterns of varying lengths and codon position. These models have been analyzed both individually and in various combinations, using sliding windows, Bayesian classifiers, Markov models, and support vector machines [16–19].

One limitation of nucleotide signature methods is that they can suggest that a particular gene is atypical, but provide no information as to where it might have originated. To discover this information, and to verify the validity of positive candidates, signature-based methods rely on subsequent validation by phylogenetic methods. These cross-checks have revealed many clear examples of both false positive and false negative predictions in the literature [20–23].

The fundamental source of error in predictions based on genomic signature methods is the assumption that a single, unique pattern can be applied to an organism's entire genome [24]. This assumption fails in cases where individual proteins require specialized, atypical amino acid sequences to support their biological function, causing their nucleotide composition to deviate substantially from the 'average' consensus for a particular organism. Ribosomal proteins, a well known example of this situation, must often be manually removed from lists of horizontal transfer candidates generated by nucleotide-based identification methods [25].

The assumption of genomic uniformity is also incorrect in the case of eukaryotes that have historically acquired a large number of sequences through horizontal transfer from an internal symbiont, or an organelle like mitochondrion or chloroplast. For example, the number of genes believed to have migrated from chloroplast to nucleus represents a substantial portion of the typical plant genome [26]. In this case, patterns of nucleotide composition should fall into at least two distinct classes, requiring multiple training sets to build successful models using machine learning algorithms. To avoid this complexity, many authors propose limiting application of their genomic signature methods to simple prokaryotic or archaeal systems.

Phylogenetic methods seek to identify horizontal transfer candidates by comparison to a baseline phylogenetic tree (or set of trees) for the host organism. Baseline trees are usually constructed using ribosomal RNA and/or a set of well-conserved, well-characterized protein sequences [27]. Each potential horizontal transfer candidate protein is then evaluated by building a new phylogenetic tree, based on its individual sequence, and comparing this tree to the overall baseline for the organism. Unexpectedness is usually defined as finding one or more nearest neighbors for the test sequence in disagreement with the baseline tree. More recently, a number of automated tree building methods have used statistical approaches to identify trees for individual genes that do not fit a consensus tree profile [28–32].

Although phylogenetic trees are generally considered the best available technique for determining the occurrence and direction of horizontal transfer, they have a number of known limitations. Analysts must choose appropriate algorithms, out-groups, and computational parameters to adjust for variability in evolutionary distance and mutation rates for individual data sets. Results may be inconclusive unless a sufficient number and diversity of orthologous sequences are available for the test sequence. In some cases, a single set of input data may support multiple different tree topologies, with no one solution clearly superior to the others. Building trees is especially challenging in cases where the component sequences are derived from organisms at widely varying evolutionary distances.

Perhaps the biggest drawback to using tree-based methods for identifying horizontal transfer candidates is that these methods are very computationally expensive and time consuming it is currently impractical to perform them on large numbers of genomes, or to update results frequently as new information is added to underlying sequence databases. Even a relatively small prokaryotic genome requires building and analyzing thousands of individual phylogenetic trees. To manage this computational complexity, many authors exploring horizontal transfer events have been forced to limit their calculations to one or a few candidate sequences at a time.

More recently, semi-automated methods have become available for building multiple phylogenetic trees at once [33, 34]. These methods are suitable for application to whole genomes, and include screening routines to identify trees containing potential horizontal transfer candidates. However, to achieve reasonable sensitivity without an unacceptable false positive rate, these methods still require each candidate tree identified by the automated screening process to be manually evaluated. One recent publication described the automated creation of 3,723 trees, of which 1,384 were identified as containing potential horizontal candidates [35]. After all 1,384 candidate trees were inspected manually, approximately half were judged too poorly resolved to be useful in making a determination. Of the remaining trees, only 31 were ultimately selected as containing horizontally transferred proteins. Despite the Herculean effort involved in producing these data, the authors concluded that it was only a 'first look' at horizontal transfer, which would need to be repeated when more sequence data became available for closely related organisms.

Given the time and difficulty of creating phylogenetic trees from scratch, a tool that automatically coupled amino acid sequence data with known lineage information could avoid an enormous amount of repetitive effort in re-calculating well-established facts. It is, therefore, somewhat surprising that currently available methods do not generally take advantage of resources like the NCBI Taxonomy database, which links phylogenetic information for thousands of different species to millions of protein sequences. One notable exception has been the work of Koonin et al. [1], who searched for horizontal transfer in 31 bacterial and archaeal genomes by a combination of BLAST searches with semi-automated and manual screening techniques. To avoid false positive results, these authors felt it necessary to manually check every 'paradoxical' best hit, in many cases amounting to several hundred matches per microbial genome. While this strategy undoubtedly improved the quality of results presented, the extensive amount of time and labor required for manual inspection precludes applying the techniques used by these authors to larger eukaryotic genomes, or to the hundreds of new microbial genomes sequenced since 2001.

One potential problem in using taxonomy database information as a horizontal transfer identification tool is the difficulty of establishing reliable surrogate criteria for orthology, which might avoid the need for extensive re-building of phylogenetic trees. It is well known that 'top hit' sequence alignments identified by the BLAST search algorithm do not necessarily return the phylogenetically most appropriate match [36]. In addition to incorrect ranking of BLAST matches, other difficulties to be overcome include differences in BLAST score significance due to mutation rate variability, unequal representation of different taxa in source databases, and potential gene loss from closely related species [37]. Finally, any detection system dependent on identifying phylogenetically distant matches may sacrifice sensitivity in detecting horizontal transfer between closely related organisms.

To address these issues, the DarkHorse algorithm combines a probability-based, lineage-weighted selection method with a novel filtering approach that is both configurable for phylogenetic granularity, and adjustable for wide variations in protein sequence conservation and external database representation. It provides a rapid, systematic, computationally efficient solution for predicting the likelihood of horizontally transferred genes on a genome-wide basis. Results can be used to characterize an organism's historical profile of horizontal transfer activity, density of database coverage for related species, and individual proteins least likely to have been vertically inherited. The method is applicable to genomes with non-uniform compositional properties, which would otherwise be intractable to genomic signature analysis. Because the procedure is both rapid and automated, it can be performed as often as necessary to update existing analyses. Thus, it is particularly useful as a screening tool for analyzing draft genome sequences, as well as for application to organisms where the number of database sequences available for taxonomic relatives is changing rapidly. Promising results can be then prioritized and analyzed in more depth using independent criteria, such as nucleotide composition, manual construction of phylogenetic trees, synteneic neighbor analysis, or other more detailed, labor-intensive methods.

Result Formats:

HTML hypertext

Normal text

RANK, STATUS, SCORE, E-VALUE, PROGRAM, Gap Penalties (Existence), Gap Penalty (Extension), EMPTY, EMTPY, MATRIX, TEMPFILENAME, QUERY LENGTH, empty, QUERY NAME, DATASET, Target length, empty, DESCRIPTION, empty, empty, empty, empty, empty, empty, empty, empty, empty, Identities, Positives, Gaps, Percentage ratio of identical matches to the length of the alignment, Percentage ratio of identical matches to the length of the query, unknown, unknown, Percentage ratio of identical matches to the length of the target, unknown, unknown, Query Start, Query End, Target Start, Target End, empty, QUERY NT, COMPARISON, TARGET NT